DOCUMENT RBSUNE 

EC 268 157 TM 860 179 



AUTHOR 
TITLE 

INSTITUTION 

SPONS AGENCY 

REPORT NO 
PUB DATE 
NOTE 

PUB T7PE 

EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



Scheunemaa, Janice 

Exploration of Causae of Bias in Test Items. 

Educational Testing Service, Princeton, NJ. Graduate 

Record Bxaai nation Board Program. 

Graduate Record Examinations Board, Princeton, 

N.J. 

BTS-RR-85-42; GREB-81-21P 

Dec 85 

78p. 

Reports - Research/Technical (143) 
MF01/PC04 Plus Postage. 

Black Students; *Colleg' Entrance Examinations; 
Graduate Study; Higher Education; *Xtem Analysis; 
Latent Trait Theory; Multiple Choice Tests; 
^Performance Factors; Racial Differences; *Racial 
Factors; Research Design; Scores; *Test Bias; Test 
Format; Testing Problems; Test Items; Mhite 
Students 

^Graduate Record Examinations 



ABSTRACT 

A number of hypotheses were tested concerning 
elements of Graduate Record Examinations (GRE) items that might 
affect the performance of blacks and whites differently. These 
elements were ch&racteristics comon to several items that otharwise 
measured different coucepts. Seven general hypotheses were tested in 
the form of sixteen specific hypotheses, most with five or six items. 
Items were developed in pairs, some with and others without the 
hypothefiised element, and were administered within the GRE 
analytical, verbal, and quantitative sections. Log linear analyses of 
the sixteen hypotheses showed interactions between group meaibership 
and item version, indicating a differential effect on black and white 
examinees' peiformance. A latent trait approach was also used to 
examine individual item bias for eight of the hypotheses. The 
manipulations included context clues, item format, vocabulary, 
selecting the false rather than the true answer in a multiple choice 
item, understanding the item writer's inference, test wiseness, 
location of the correct response, and ambiguity versus structure. It 
was concluded that the item manipulations did have differential 
effects oa blacks' and whites' performance. However, the complexity 
of the effects suggested that other uncontrolled factors affecting 
performance were present. (Author/GDC) 



<***>««**>************************************************************* 

* Reproductions supplied by EDRS are the best that can be made * 

* from the original document. * 

***************" .f ************************************************* A **«* 



ERLC 
L 





U A OVARTMOIT or HNICATIOM 

NATIONAL INnTTUTE OF EDUCATION 

ZnONAL RESOURCES INFORMATION 
CENTER (ERIC) 
document hat bMn ftpcoductd m 
rac«fv«d from ttw p«r«oo or ofB»ni«tioo 



□ Mirw chtngM h«v« bMn m»d« to improve 
ripfockjction qutHtv 

• Points of vwwv or optPionsftittdin ttmiiocu- 
m«nt do not tmmutitf ropriitnt officiil NIE 
petition or pottcv 



"PERMISSION TO REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC)." 



C.KADrAi I K1 1 OKI) KXAMIN AI IONS 



FXPLORATION OF CAUSES 
OF BIAS IN TEST ITEMS 



Janice Scheuneman 



GRL Board Professional Report GREB No. 81-21? 
ETS Research Report 85-42 



DcL-einber 1985 



I h i s report pre sen t s t he f incl iri^s of <i 
researeh project funded by and carried 
cnit iijide r the auspices of t he (-raduat e 
Ixe^^ord Ex am in at i ons Board . 



^nilCAflONAI If SflNC. SlKVKf , PRINC I IOn, Nj 



Exploration of Causes of Bias in Test Items 



Janice Scheuneman 



GRE Board Professional Report No. 81-21? 



December 1985 



Copyright (c) 1985 by Educational Testing Service 
All rights reserved 



Acknowledgments 



My thanks to my colleagues at ETS who prepared the test items for this 
study and without whom the s*:udy could never have been done* They are: 

Verbal: Mari Pearlman 

£dward Shea 

Quantitative: Patricia Kuntz Klag 

Beverly Whitvington 

Analytical: Clark Chalifour 

Richard Evert 
Elizabeth McGrail 
Erich Woisetschlaeger 

Carole Slaughter and Paul Ramsey also participated in the planning 
phases and in review of the items following analysis. Paul Holland 
provided advice and counsel on the log linear analyses* The item bias 
analyses were planned and executed by David Thissen and Lynne Steinberg 
at the University of Kansas in consultation with Howard Wainer of 
Educational Testing Service. 



ERiC 



4 



Abstract 



Although the study of item bias has been an active area of 
research for the past few years, little progress has been made in 
understanding what factors in a test might create bias. To advance 
this understanding, a number of hypotheses were developed concerning 
elements of the items that might affect the performance of Blacks and 
Whites differently. The elements investigated were characteristics 
that would be common to several items that otherwise measured different 
concepts or content. 

Seven general hypotheses were developed and tested in the form of 
16 specific hypotheses, most with five or six items. For purposes of 
thji ,tudy, items were developed in pairs, Wxth a hypothesized *»lement 
present in one item and missing or modified in the other. The item 
pairs were assembled into two separate sections and administered as 
part of the Graduate Record Examinations General Test. Pairs of test 
sections were developed for each of the three areas tested in the 
General Test: verbal, quantitative, and analytical ability. These 
sections were spiralled for administration to yield randomly equivalent 
groups of examinees. 

The 16 specific hypotheses were evaluated separately using log 
linear analyses. Results showed that 10 of the 16 hypotheses showed 
Interactions between group membership and item version indicating a 
differential effect of the manipulation on the performance of Black and 
White examinees. Eight of thrse hypotheses were also analyzed for item 
bias at the individual item level, using a procedure based on item 
response theory. 

i'he conclusion was that item manipulations of the type 
hypothesized did have a differential effect on the performance of Black 
and White examinees. The complexity of the effects, however, suggested 
that other uncontrolled factors affecting periformance were also 
operating* A number of suggestions tor future research based on these 
results were provided. 



-4- 



To date, research on test bias has done little to cast light on 
the possible sources of bias In testing. If, however, bias Is In fact 
a factor In tests that works to Inflate score differences between two 
population groups, some understanding of the causes of bias Is 
essential If we are, one day, to take effective remedial action. A 
theory of bias now being developed provides a perspective broader than 
the perspectives suggested by previous research, which has tended to 
focus on other bias Issues. From the perspective of this bias theory, 
questions concerning b *urces or causes of bias can be more readily 
formulated and addressed. (Details of the bias theory and Its relation 
to the results of test and Item bias research are provided by 
Scheuneman 1981, 1984.) 

In this theory, bias Is defined as a multlfaceted component of 
observed test scores that, like the usual measurement error, caases an 
observed score to be different from the "true" score, but, unlike the 
usual measurement error, Is associated with membership In a particular 
group and, for members of that group, has an expected value other than 
zero. That Is, a biased score Is a systematic o^er- or underestimate 
of ability for members of the group. This bias Is postulated to stem 
from two major sources of bias — Individual difference characteristics 
of examinees, which are differently distributed within groups, and 
characteristics of tests or test items, which have different effects 
for persons in different groups. 

Of primary interest in this study are sources of bias that lie in 
characteristics of test items. In this respect, the study relates to 
item bias research, where the identification of items that contain such 
characteristics is clearly an objective. The item bias methodologies, 
however, have focused on detecting "outliers," items that are function- 
ing most differently for two groups. Perhaps this is why the inter- 
pretation, the attribution of possible causes of the obtained result, 
has also tended to focus on the peculiarities of the items identified 
rather than on commupalities that may exist among them. The bias 
theory from which this study is derived argues that the causes of bias 
that are likely to be important, in the sense of a greater distortion" 
of score differences between groups, are those that lie in character 
Istlcs that are common to several items in a test* 

.xcording to the theory, bias may even exist in tests where no 
outliers are found. For example, if all the items in a test use a 
single format and chat format is more familiar to the members of one 
group than to thoee of another, the test scores for the two groups may 
suggest that the differences in ability are larger than they really 
are. that is, the test would be biased. Further, several such 
characteristics may be operating in a single test. Each such 
characteristic may, by itself, have relatively little impact, but the 



6 



-5- 



cumulatlve effect of several such characCerisCics may be significant. 
Items detected as outliers in statistical studies of item bias may, 
therefore, be those where other item features interact with the biasing 
characteristic to produce an unusually large effect or two or more 
biasing characteristics may be operating in the same item. If so, such 
effects would be unlikely to be readily discernible by an investigator, 
particularly if items were examined in isolation, perhaps explaining 
why item bias results have so frequently been found to be uninter- 
pretable. 

If these more general but subtler effects exist, they should be 
detectable. The thesis of this work, therefore, is that character- 
istics of items can be identified that are not peculiar to single items 
and that have a differential effect on the performance of Black and 
White examinees. In evaluating this thesis, however, a practical trade 
off was necessary. Prior to this work, little but speculation existed 
as to what such characteristics might be. A choice needed to be male 
with regard to how many hypotheses concerning the nature of these 
characteristics could be evaluated within the scope of this study. If 
few were evaluated, more items could be developed for each and hence 
move p'^wer would be available for identifying effects expected to be 
subtle. As a first effort in this direction, however, a decision was 
made to look at several possible hypotheses to gain more breadth of 
results at the risk of less certainty about the outcome. In this re-- 
spect, this study should be seen as exploratory. The major questions 
CO be addressed here are two: Can such effects be demonstrated in at 
least some instances, and what avenues might fruitfully be explored in 
more depth and with better control in the future? 

Method 

To evaluate the effects of the hypothesized characteristics, items 
were constructed in pairs with the characteristic or item feature pres- 
ent in one item and absent or modified in the other. Items were then 
assembled into different test books and administered as part of the GRE 
General Test. Test books were spiralled for administration, yielding 
randomly equivalent samples of BlacK and White examinees for each item 
ot a pair. The differences in performance on items with similar char- 
acteristics were then evaluated to determine if the effect of the item 
manipulation was the same for both groups. 

The Test 

The GRE General Test consists of seven separately timed sections. 
The operational portion of the test, from which examinees' scores are 
obtained and reported, consists of two sections each of verbal, quant- 
itative, and analytical ability items. The seventh section (which may 
or may not be in the seventh position in the test book) serves one of 
several purposes for a given administration. It may be used for equat- 
ing, pretesting new items, or various experimental purposes. 



7 



^6- 



The verbal sections of the test are made up of four Item types: 
sentence completions, antonyms, analogies, and reading comprehension 
sets* The quantitative sections Involve concepts from arithmetic, al- 
gebra, and geometry. In addition to the usual math-type xtem format, 
these sections Include data interpretation Items and quantitative com- 
parison Items. Quantitative comparisons require the examinee to com- 
pare the quantities given In two columns and to decide whether one 
quantity Is greater than the other, whether the two quantities are 
equal, or whether the relationship cannot be determined from the in- 
formation given. The analytical ability sections consist of two types 
of questions, analytical reasoning and logical reasoning. Analytical 
reasoning items focus on the ability to analyze a given structure of 
arbitrary relationships and to deduce new information from that struct- 
ure. Logical reasoning items focus on the ability to understand and 
analyze relationships among arguments or parts of an argument. 

The Subjects 

The experimental items were administered in six different forms of 
the variable section of the GRE General Test. Forms were spiralled for 
admlnlGtration yielding random samples of examinees from the December 
1982 Saturday administration of the test. Ethnic background was self- 
identified on registration material submitted prior to the testing and 
matched later with the score records. 

The Black and White examinees from this administration were found 
to differ in background characteristics other than racial or ethnic 
group. In particular, the Black examinees were more often female and 
had different concentrations in areas of academic preparation than 
White examinees. Since performance is known to *3e dlffere^^ in at 
least some instances for men and women and for students from different 
academic backgrounds, these differences between the Black and White 
groups could cause difficulties in attributing differences to racial or 
ethnic group membership, although clearly a number of other background 
factors had been left uncontrolled. Analyses were herce performed 
using all examinees who identified themselves as Black and a sample of 
Whites selected to be similar to the Blacks in these background 
characteristics. 

The White sample was selected randomly within categories defined 
by sex and by broad major field of undergraduate work: physical 
sciences, biological sciences, humanities, and social sciences. A 
spaced sample of two of every three men was taken first and pooled with 
all women for each form. Spaced samples were then taken to select one 
of three humanities majors, one of two biological science majors, and 
one of three physical science majors. All social science majors were 
retained. Characteristics of the samples for each of the three pairs 
of tests are given in Tables 1 and 2. Differences between the pro- 
portion of males and females and among the proportions in each major 
field were not significant for the four samples (Black samples and 
White samples for each of the two tests) for any of the three content 
areas. 



ERIC 



8 



t 






















Table 1 














Number of Examinees 


by Race 












ar.d Academic Major 


Subject 












(All Black Examinees /White Sample) 














Verbal 


Quantitative 


Analytical 






Major 


Black 


White 


Black 


White 


Black White 




Test 1 


Humanities 


27 


187 


31 


162 


32 179 






Social Sciences 


178 


1219 


178 


1220 


154 1188 






Biological Sciences 


51 


353 


6C 


366 


50 322 






Physical Sciences 




188 


33 


162 


36 213 






Total 


280 


1947 


302 


1910 


272 1902 






2 

X Between Groups 




.34 




3.79 


3.83 






Probability 




.95 




.28 


.28 




Test 2 


Humanities 


20 


184 


30 


196 


35 181 






Social Sciences 


160 


1233 


188 


1248 


170 1208 






Biological Sciences 


51 


343 


56 


342 


50 356 






Physical Sciences 


34 


169 


33 


157 


35 189 






Total 


'J 7 Q 


1 Q 0 Q 

1 yZy 


307 


1943 


290 1934 






2 

X Between Groups 




5.19 




2.70 


4.02 






Probability 




.16 




.44 


.26 






2 „ 

X Between Tests 


2.80 


1.18 


2.70 


4.16 


.36 3.04 






Probability 


.42 


.76 


.44 


.25 


.95 .38 












9 


1 






ERJC 

































Table 2 



Number of Examinees by Race and Sex 
(All Black Examinees/WhiCe Sample) 



Verbal Quantitative Analytical 





Major 


Black 


White 


Black 




White 


Black 


White 


Test 1 


Male 
Female 


92 
188 




653 
1294 


103 
199 




610 

i;<oo 


93 
179 


669 
1233 




Total 


280 




1947 


302 




1910 


272 


1902 




2 

X Between Groups 
Probability 




.05 
.82 






.56 
.45 




.10 
.75 




Test 2 


Male 
Female 


82 
193 




621 
1308 


106 
201 




626 
1317 


97 
193 


668 
1266 




Total 


275 




1929 


307 




1943 


290 


1934 




2 

X Between Groups 
Probability 




.62 
.43 






.64 
.42 




.13 
.72 






2 

X Between Tests 
Probability 


.60 
.44 




.80 
.37 


.03 
.85 




.64 
.42 


.03 
.85 


.17 
.68 



10 



ERIC 



-9- 



Development of the Experime n tal Items 

A series of general hypotheses was Initially generated that 
suggested possible sources of bias In test Items of the type appearing 
In the GRE General Test, These hypotheses were based on previous worl^ 
(Scheuneman, 1979, 1982) and on suggestions from ETS staff and GRE 
Board advisory commltteeSr Meetings were then held with the respective 
ETS test development staff members who would be preparing the items for 
each of the three ability areas measured by the test: verbal, 
quantitative, and analytical. At each meeting, these hypotheses were 
discussed to determine which ones might be evaluated within that 
particular area and the types of items that might be used in each case* 
Out of this process, 16 specific hypotheses were selected that could be 
subsumed by seven more general hypotheses* These hypotheses are 
described in the following section* 

For each of the hypotheses, items were developed in pairs that 
were kept as similar as possible except for the hypothesized factor* 
Items were then arranged into two test forms with one item from each 
pair appearing in a particular form with the same serial position in 
each test* Two paired forms were developed in this way for each of the 
three ability areas* For quantitative and analytical tests all the 
items for a given version (A or B) of a hypothesis appeared in the same 
test book. Except for Hypothesis 1*1, some items from both versions of 
the verbal hypotheses appeared in each test book* 

The Hypotheses 

Throughout this report, the discussion will be clarified if a few 
terms are defined here* Each item pair occurs in the same serial 
position in a test section* Hence, verbal "Item 1" refers to the pair 
of items that appear in the first position in the two test sections 
with verbal items* Each 'item" thus has two versions* These two 
versions differ according to some manipulation of the item elements 
dictated by the expectations of a given hypothesis* Although the two 
versions of an item pair were kept as similar as possible, in most 
instances other item elements also changed as a necessary consequence 
of the intended manipulation if the items were to appear sensible and 
to function properly. Such "unintentional" changes were probably the 
source of much of the noise that became apparent in the results. 

In the following section, a brief description is provided of each 
hypothesis and the two item versions* A summary of the hypotheses nnd 
versions is provided in Table 3 for reference in reading this report. 
Table 4 shows the content area — verbal, quantitative, or analytical — in 
which a hypothesis was evaluated and the number of items in each of the 
hypotheses and a^eas* 



ERIC 




-10- 



Table 3 
Summary of Hypotheses 

Hypothesis 1*0 

Item format affects the performance of Blacks more than that of VHiltes* 



1. 


lA 


Vocabulary ouf of context (a'^'^onyms) 


1. 


IB 


Vocabulary In context (sentence completionf 


1. 


2A 


Quantitative comparioons 


1. 


2B 


Standard multiple-choice 


1. 


3A 


Standard multiple-choice 


1. 


3B 


Roman numeral format 



Hypothesis 2,0 

Blacks are more apt than Whites to be misled by vocabulary lte.fls that 
depend on using secondary meanings or altering word meaning with 
prefixes or suffixes to produce Item difficulty. 

2* lA Less common word meaning 
2. IB More common word meaning 

2-2A Suff Ix/pref Ix with less common meaning 
2.2B Suf f Ix/pref Ix with more common meaning 



Hypothesis 3«0 

Items that ask for the one false answer rather than the one true answer 
are more apt to be confusing to Blacks. 

3. lA One false answer 
3. IB One true answer 

3. 2 A One true answer 
3.2B One false answer 



H ypothesis 4.0 

Performance of Blacks is more likely to be affected by iuems calling 
for inferences, since such items are more apt to require inferring the 
intent of the item writer. 

4.1A Inference from reading passage required 
4. Ill Material directly stated in passage 



ERLC 



12 



Table 3 (cont.) 



4.2A Implies most likely response is required 

4.2B States most likely response is required 

4.3a Implies all possible responses 

4.3B States "a complc^te and accurate list'* 



H ypothesis 5.0 

White examinees are more likely than Blacks to cc^pitalize on the 
information often provided unintentionally in test items. 

5.1A Test-wiseness cues absent 
5. IB Test-wiseness cues present 

5.2A Test--wiseness cues on distractor 
5.2b Test-wiseness cues on key 



Hypothesis 6.0 

Blacks and Whites will be differently affected by the placement of the 
key among the distractors. 

6.1A Strongest distractor before key 
6. IB Key before strongest distractor 

6.2A Key in center of option sequence 
6.2b Key not in center of sequence 



Hypothesis 7.0 

Ambiguity in a task is more apt to be misleading to Blacks. Hence, 
Blacks will tend to perform better when the task is structured in 
concrete terms. 

7.1A Use numbers in proLl^ui 
7. IB Use symbols in problem 

7.2a cse diagrams 

7.2B Use verbal descriptions 



13 



-12- 



Table 4 

Number of Hypotheses and Items Tested for Each 
Content Area 



Content 
Area 



Hypothesis 
Number 



Items per 
Hypothesis 



Items per 
Area 



Verbal 



1.1 
2.1 
2.2 
3.1 
4.1 
5.1 
6.1 



5 
7 
5 
5 
5 
5 



42 



Quantitative 



1.2 
5.2 
6.2 
7.1 
7.2 



6 
6 
6 
6 
6 



30 



\nalytlcal 



1.3 
3.2 
4.2 

4.3 



5 
5 
5 
5 



20 



EMC 



14 



-13- 



Hypothesls 1>0 > Hypothesis KO concerns the way which the task 
required by the Item Is presented. Five Items were developed for each 
of three specific hypotheses. 

Hypothesis l.l contrasts antonym and sentence completion Items. 
Since Black examinees as a group have frequently been demonstrated to 
have less well developed vocabularies, it was expected that the context 
provided by the sentence completion Items would help compensate for 
this difference. Sentence completion Items were expected to be easier 
than the antonym Items and the difference In difficulty was expected to 
be larger for Blacks than for Whites. The stimulus word for the 
antonym versions was the key term In the sentence completion Items. 
The stimulus words were arbit r ary , dichotomy , naivete, Ingenuity , and 
diaphanous . This last term comes from a disclosed Item and Is used 
Example 1.* 

Example 1 
Hypothesis 1. 1 



Version A Version B 

DIAPHANOUS: By Its very strength and 

sharpness the sunlight of Greece 
(A) Inherent forbids the shifting, melting, 
*(B) substantial effects which give so 

(C) barbarous delicate a charm to the 

(D) obsolete French or Italian scene. 

(E) repetitive 

(A) Insipid (B) barbarous 

(C) sustaining *(D) diaphanous 
(E) ostentatious 



* Notice that unless otherwise indicated examples were developed 
specifically For this paper since one version of the items used in the 
study remains in the active item pool in most instances and hence is 
still secure. 




x5 



-14- 



Uypothesx^ 1.2 contrasts two quantitative item types — quantitative 
comparisons dixd the standard problem format. It was thought that the 
less f ami. iar quantitative comparisons items might be more difficult 
for Blacks and cause their performance to differ more chan that of 
Whites. The problem posed was the same in both versions, but the form 
in which the answer choices were presented differed. See Example 2. 



Example 2 
Hypothesis 1.2 



Version A 

A if the quantity in Column A is greater; 
B if the quantity in Column B is greater; 
C if the two quantities are equal; 
D if the relationship cannot be 

determined from the information given. 



Version B 



Column A 



Column B 





In the figure above, x= 

(A) 10 (B) 12.5 
*(C) 20 (D) 35 (E) 40 



16 



Hypothesis 1.3 concerned the way in which the options were presented. 
Analytical Items were developed where the situational set was the same for 
both versions, but the options were presented In either a standard set of 
multiple-choice options or a "Roman numeral" format. In the latter, some 
number of statements Is made (usually three) each designated with a Roman 
numeral. In response to a question concerning which of these are true, the 
options would be of the form "I only," "I and II", etc. Again, It was 
expected that the standard format would be comparatively easier for Blacks. 
Example 3 Illustrates this hypothesis. 



Example 3 
Hypothesis 1.3 



All M's are Q's. 

All F's are Q's. 

All Q's are F's or M's, but not both. 
Some Q's are Y's. 

Not all Y's are Q's. 

All W's are Y's. 

Version A Version B 



If a W Is an M, It must also be 



If a W Is an M, which of the 
following must a W also be? 



(A) 
(B) 
(C) 
(D) 
*(E) 



a Y only 
a Q only 
an F only 



an F and a Q 
a Y and a Q 



I. 
II. 
III. 



Y 

Q 
F 



(A) I only 
(C) III only 
(E) II and III 



(B) II only 
*(D) I and II 



17 



-16- 



H ypothesis 2.0 . Again, on the assumption that, as a group, Blacks 
have Ic -tensive vocabularies, the hypothesis was that manipulation 
of vocabularies In antonym items to Increase the difficulty would have 
a greater Impact on the performance of Black examinees than on that of 
White examinees. 

Hypothesis 2.1 concerned the effect of ln< reaslng the vocabulary 
level of the teri.* Intended to be the key in order to make the item more 
difficult. Seven items were developed using the following words as 
stimuli: accretion , conventional sap , grave , requisite , curb , and 
dampen . The first of these was disclosed and this item pair is shown 
as Example 4. For each version, different words were provided as the 
correct response, though generally the dlstractors remained the same. 



Version A 
Accretion: 

(A) disi-iterest 

(B) disagreement 

(C) exactitude 

(D) adhesion 
*(E) attrition 



Example 4 
Hypothesis 2.1 



Version B 
Accretion: 

(A) disinterest 

(B) disagreement 

(C) exactitude 

(D) adhesion 
*(E) diminution 



In Hypothesis 2.2, the item difficulty was altered through the 
addition, deletion, or modification of prefixes or suffixes. The 
tollowing were the stimulus words with the alternate version in 
parentheses: (un)galnly , (un) Impeachablo , (Im)palpable , 
(in)determlna te , and gulle(ful/less) . Example 5 Illustrates this 
hypothesis. 



ERIC 

hnimiiirTirrTuaaiia 



Version A 
Ineffectual: 

*(A) powerful 

(B) energetic 

(C) knowledgeable 

(D) economical 

(E) skillful 



Example 3 
Hypothesis l.l 



18 



Version B 
Effectual: 

*(A) powerless 

(B) pathetic 

(C) Ignorant 

(D) costly 

(E) awkward 



-17- 



Hypothesis 3,0 , The standard multiple-choice item essentially 
asks the question, Which of the following options is true? A common 
variant reverses this by stating that all of the following are true 
EXCEPT... (or by implication, Which of the following is false?). Such 
items hav. often been identified as problematic in previous bias 
studies. It was hypothesized here that the impact of this negative 
phrasing would be greater on the performance of Black examinees. Five 
item pairs of each type were developed, five with verbal items 
(Hypothesis 3.1) and five with analytical items (Hypothesis 3.2). The 
passage to which the items referred was the same for an item pair while 
the statements of the question (which of the following is true or 
false) and the alternative statements constituted the two item 
versions. A single example is adequate for Hypotheses 3.1 and 3.2 
since che difference between these is in the type of content, verbal or 
analytical, not the form. Example 6, an analytical item, is 
illustrative. 

Example 6 

Hypothesis 3.1 and 3.2 

Offshore blasting in oil exploration does not hurt 
fishing; blasting started this year, and this year's 
salmon catch has been the largest in a long time. 



Version A 

All of the following statements, 
if true, are valid objections 
to the argument above EXCEPT: 



(a) The salmon is ^nly one of 
many species of fish tha; 
might be affected by the 
blasts . 

(B) The rapid changes of water 
pressure caused by the 
blasts make salmon mate more 
frequently. 

(C) The noise of the blasts 
interferes with the feeding 
habits of salmon. 

(D) Vibrations from the blasts 
destroy fish eggs. 

(E) Factors that have nothing to do 
with the well-being of salmon 
My significantly affect the size 
of one year's catch. 



Version B 

Which of the following state- 
ments, if true, is a valid 
objection to the argument 
above? 

*(A) The salmon is only one of 
many species of fish that 
might be affected by the 
blasts. 

(B) The rapid changes of water 
pressure caused by the 
blasts make salmon mate more 
frequently. 

(C) Salmon are particularly vul- 
nerable to the effects of 
underwater blasting. 

yD) Salmon spawn in fresh water 
rather than the ocean. 

(E) Oil exploration is essential 
to the nation's economy. 



19 



-18- 



H ypothesis 4>0 . To the extent that the task demanded by the Item 
Is ambiguously stated or Is open to Interpretation, White examinees are 
hypothesized to be better able to Infer the intent of the Item writer 
and less apt to be confused or misled. Consequently , Black examinees 
V7ere expected to be Influenced more by different statements concerning 
the task requirements. Five Item pairs were developed for each of 
three specific hypotheses. 

Hypothesis 4.1 deals directly with Inference, assuming that 
material not stated may be less accessible to Black than to White 
examinees. Here the two Items of the pair were basically the same. 
The different versions were reflected by wi^hether or not the material to 
be Inferred was directly stated In the passage with a corresponding 
change In the statement of the question. Illustrations are provided In 
Example 



There Is little agreement concerning the way In which kinds of 
avalanches should be classified. Some classification systems 
depend on the kind of snow Involved, others are concerned with the 
type of movement, and one scheme Includes both, as well as several 

(5) other criteria. Existing descriptive terms, most of them German, 
are deeply rooted in avalanche parlance: they are expressive, but 
they are often untranslatable into other languages and lack 
precision in their own. Furthermore, as Dr. Qaervain has pointed 
out,* avalanches are not only concrete objects capable of being 

(10) photographed; they are also events. As events, they Include, for 
example, the development of the avalanche through the influences 
of weather; the incident that starts the snow moving; and the type 
of movement. The description of the avalanche as an object 

(15) includes information about the depth, physical consistency, and 
stratification of the snow, the features of the terrain, and the 
type and the dimension of the break. 

It can be inferred that the author mentions Dr. Quervain's 
observations (lines 8-10) primarily in order to 

(A) indicate the temporary nature of an avalanche 

(B) illustrate the imprecision of the terms used in avalanche 
classification 

(C) Introduce his own avalanche classification system 

*(D) further emphasize the complexity of avalanche classification 
(E) further explain why it is important to classify kinds of 
avalanches 



*Underliae shows the section of the passage that differs between 
versions. The underline did not appear in the original item. 



Example 7 
Hypothesis 4. 1 



Version A 




-19- 



Versioa B 



There is little agreement concerning the way in which kinds of 
avalanches should be classified. Some classification systems 
depend on the kind of snow involved, others are concerned with the 
type of movement, and one scheme includes both, as well as several 

(5) other criteria. Existing descriptive terms, most of them German, 
are deeply rooted in avalanche parlance: they are expressive, but 
they are often untranslatable into other languages and lack 
precision in their own. Other difficultjes in establishing 
classification schemes for avalanches ara illustrated by 

(10) Dr. Quervain when he points out that avalanches are not only 
concrete objects capable of being photographed, they are also 
events. As events, they include, for example, the development of 
the avalanche through the influences of weather; the incident that 
starts the snow moving; and the type of movement. The description 

(15) of the avalanche as an object includes information about the 

depth, physical consistency, and stratification of the snow, the 
features of the terrain, and the type and the dimension of the 
break. 

According to the passage, Dr. Quervain' s observation (lines 8-12) is 
intended primarily to 

(A) indicate the temporary nature of an avalanche 

(B) illustrate the Imprecision of the terms used in avalanche 
classification 

(C) introduce his own avalanche classification S/Stem 
*(D) further emphasize the complexity of avalanche 

classification 

(E) further explain why it is important to classify kinds of 
avalanches 




ERIC 



Hypotheses 4.2 and 4.3 concerned analytical Items. In both cases 
the Item pairs were Identical except for a change In the wording of the 
question asked. For Hypotbeuls 4.2, one version asks, Which Is the 
"best" or "niost" likely response? In the alternate version the "best" 
or "most" was deleted* T.n Hypothesis 4.3, the alternate wordings were 
similar to "Which of the following could be true?" and "Which of the 
following is a complete and accurate list of what could be true?" 
Illustrations are shov^ Example 8. 

Example 8 

Hypothesis 4. 2 

It is clear from trends in the 1970's that economic classes 

In the United States have been growing farther apart from 

one another rather than becoming more nearly equal* The 

weekly spendable earnings of private-sector nonsupervisory 

workars have declined by 10 percent since 1970, and by 

1979 had returned to their level of 1964. The claim of equality 

in the United States is more and mere difficult to sustain. 



Version A 

Which of the following, if true, 
would support the argument of 
the passage above? 

(A) The rate of unemployment 
fluctuated greatly in the 
1970's* 

(B) Public-sector workers also 
suffered a decline in their 
spendable income in the 
1970's* 

(C) The spendable income of 
workers declined in the 
1950's. 

*(D) Supervisors and owners main- 
tained the level of their 
spendable income in the 
1970's. 

(E) ?)y 1964 there was already a 
sizeable gap between economic 
classes in the United States. 



Version B 

Which of the following, if true, 
would best support the argument of 
the passage above? 

(A) The rate of unemployment 
fluctuated greatly in the 
1970'8* 

(B) Public-sector workers ''Iso 
suffered a decline in chelr 
spendable income in the 
1970's. 

(C) The spendable income of 
workers declined in the 
I950's. 

*(D) Supervisors and owners main- 
tained the level of their 
spendable income in the 
1970^8. 

(E) By 1964 there was already a 
sizeable gap between economic 
classes in the United States « 



22 



-21- 



Hypothesis 4.3 



Seven bottles of cliemicals are arranged on a shelf 
in seven spaces numbered 1 through 7 consecutively 
from left to right. Each bottle occupies one 
space. Three bottles are filled with sulfate, two 
are filled with hydroxide, and two are filled with 
chloride. 

No bottle of sulfate is next to another 
bottle of sulfate. 

None of the bottles of sulfate is in 
space 3. 

Neither bottle of chloride is in space 5. 
Version A Version B 



are next to each other and the two are next to each other and the 



If the two bottles of chloride 



If the two bottles of chloride 



bottles of hydroxide are next to 
each other, which of the 
following chemicals could occupy 
space 2? 



bottles of hydroxide are next to 
each other, which of the 
following is a complete and 
accurate list of the chemicals 
that could possibly occupy space 



2? 



*(A) Only chloride 

(B) Only hydroxide 

(C) Either chloride or hydroxide 

(D) Either chloride or sulfate 

(E) Either hydroxide or sulfate 



*(A) Chloride 

(B) Hydroxide 

(C) Chloride, hydroxide 

(D) Chloride sulfate 

(E) Hydroxide, sulfate 



23 



-22- 



Hypothesls 5,0 , Just as White examinees might be better able to guess 
the Intent of predominantly White item writers, so might they be better 
able to make use of cues inadvertantly provided that might signal which 
of the options presented is the intended key. The core for these items 
was the item stem. In one version, changes were made deliberately to 
point to the key in ways that would normally be avoided by item 
writers. These changes were absent in the alternate version. 

Hypothesis 5.1 concerned verbal items and the cues provided were 
of the type that test developers know about and that have been 
discussed in the literature. Hypothesis 5.2 concerned quantitative 
items and specifically evaluated an option elimination strategy of a 
type identified by Smith (1982) for verbal items and generalized to 
quantitative items by Kuntz (1982). These hypotheses are illustrated 
In Example 9. 

Example 9 

Hypothesis 5. 1 

Khrushchev's gjft to history is, and always 
was, himself. Khrushchev's greatest 
qualities, those that distinguished him from 
ail other Soviet leaders, were his energy, 
his enthusiasm, his confidence in himself 
and in others. It was his prodigal 
personality, his ability to confess a 
mistake and reverse himself, his explosive 
unpredictability that did more than anything 
else to spring the genie of spontaneity out 
of the bottle of repression in which Stalin 
had contained the Russian spirit for thirty 
years. 

Version A Version B* 

According to the passage all According to the passage, alJ 

of the following describe of the following describe 

Krushchev's personality EXCEPT Krushchev's personality EXCEPT 

(A) energetic (A) energetic 

(B) unpredictable (B) spontaneous 

(C) crafty (C) enthusiastic 

(D) clever (D) clever 
*(E) inflexible *(E) inflexible 



*Notice that in this item version the key is the only one of the 
distractors with a negative tone. 



24 




Hypothesis 5.2 



If X * 3z + 9 and z » y + 5, what is x in terms of y? 



Version A 



Version B 



(A) y + 14 

(B) 3y - 6 

(C) 3y + 4 

(D) 3y + 14 
*(E) 3y + 24 



(A) y + 24 

(B) 3y - 6 

(C) 3y + 4 

(D) 3y + 14 
*(E) 3y + 24 



Hypothesis 6,0 , Effective test-taking strategies will positively 
influence performance. One hypothesis is that Black examinees have 
less well developed test-taking skills. One possible outcome of this 
is that the position of the key will affect their performance 
differently than that of White examinees. 

Hypothesis 6.1 used 10 verbal analogy paiis. In each item pair, 
the stimulus words and the options were the same. The two versions 
differed only in the option order. Similarly for Hypothesis 6.2 the 
problem posed was the same for the six quantitative item pairs. 
Because options were placed in order of ascending (or descending) 
magnitude, key placement was influenced by replacing an option of 
magnitude less than the key with one larger (or vice versa). Item 
pairs illustrating these hypotheses are given in Example 10. The 
analogy item is a disclosed item actually used in this study- 



Example 10 



Hypothesis 6.1 



Version A 



Versio*^ B 



Parchment : Paper : : 

(A) ink : quill 

(B) radiator : thermostat 

(C) embroidery : broadloom 

(D) citrus ; juice 
*(E) clavicord : piano 



Parchment: Paper: : 
*(A) clavicord : piano 

(B) radiator : thermostat 

(C) embroidery : broadloom 

(D) citrus : Juice 

(E) ink : quill 



25 



-"24- 



Hypo thesis 6.2 



If 15<^x<^25 and y - x » 3, what is the greatest possible value of 
X + y? 



Hypothesis 7,0 , Ambiguity in an item increases as the content of 
an item becomes more removed from the experience of the examinee. 
Again it Is hypothesized that Black examinees are more apt to be misled 
by ambigious item content. Consequently, they would be expected to do 
relatively better where the content is less abstract and more concrete. 
This hypothesis was evaluated with two quantitative hypotheses with six 
item pairs each. 

Hypothesis 7.1 concerned the use of algebraic symbols or variables 
such as X and ^ rather than numbers. The problem to be solved was the 
same in both versions, but the versions differed in whether this 
problem was posed using numbers or letters. Example 11 gives an 
illustration. 



Version A 



Version B 



(A) 60 

(B) 56 
*(C) 53 

(D) 47 

(E) 28 



(A) 56 

*(B) 53 

(C) 47 

(D) 40 

(E) 28 



Example 11 



Hypothesis 7. 1 



Version A 



Version B 



A ladder that is 10 feet long is 
leaning against a wall. The base 
of the ladder is on the floor 6 



A ladder that is x feet long is 
leaning against a wall. The base 
of the ladder is on the floor y 
feet from the wall. What is the 
distance in feet from the bottom 
of the wall to the point where 
the ladder touches the wall? 



feet from the wall. What is the 
distance in feet, from the bottom 
of the wall to the point where 



the ladder touches the wall? 



a) 6 

*(B) 8 

(C) 9 

(D) 10 





(E) 2 V34 



(C) Vx + y 



(D) /IT^ 

(E) ^ 



ERIC 



26 



-25- 



Hypothesls 7.2 concerned the use of figures or diagrams to 
Illustrate the problem posed. Again the problem to be solved was the 
same for the Item pair. The verslou depended on whether a diagram or 
verbal description was provided In Illustration. See Example 12. 

Example 12 

Hypothesis 7.2 



Version A 



16 



Square 



Rectangle 



Version B 

If a square with each side x units 
In length has the same area as a 
rectangle 4 units by 16 units, then 



In the figure above, If the area of 
the square Is equal to the area of 
the rectangle, then x * 



*(A) 8 
(E) 4 



(B) 7 (C) 6 (D) 5 



*(A) 8 
(E) 4 



(B) 7 (C) 6 



0 5 



Data Analysis 

Each hypothesis of a differential effect on the performance of 
Black and White examinees due to an item manipulation was tested using 
log linear procedures. The models chOi"" "".n were "logit models," which 
are analogous to multiple regression procedures in which the variables 
are all categorical. The item response — correct or incorrect— was 
treated as the dependent variable while the item pair, the item 
version, and group membership were the independent variables (or 
predictors). In these analyses, the effects of most interest for the 
purposes of this study were the interaction between group and version 
and the three-way interaction with group, version, and Item pair. Both 
of these interactions indicate that the effect of the item manipulation 
was different for Blacks and Whites. The three-way interaction 
indicates that this difference, ^n effect varied among item pairs. 

In addition to th<i log lii:ear analyses of the effects of the 
Independent variables on performance, item bias analyses were conducted 
that also took into account differences in item discrimination and the 
relative abilities of the two groups, xhese two sets of analyses 
dlftered in their perspective on the problem and the questions 
addressed by each. Where the focus of the log linear analyses was on 
the differences in difficulty between the two item versions within each 
group, the focus of the item bias analyses was on the differences 
between the two groups within an Iten version. 



ERIC 



27 



The Item bias analyses used methods based on Item response theory 
models* For the purposes of these analyses, an unbiased Item was 
defined as one for which the probability of a correct response Is the 
same for persons of a given level of the ability measured by the test 
regardless of the^r group membership. In the terms of Item response 
theory, this defi..x.:on may be stated: the Item characteristic curves 
of an unbl&^ad Item oust be the same for two groups of interest (Lord, 
1977; Scheuneman, 1980). In practice this means that the three 
parameters that serve to define these curves are the same for both 
Blacks and Whites. The method chosen to estimate these parameters was 
developed by Thlssen (1982a) and is appropriate for this application 
where one sample is relatively small and the other much larger* 

Both the log linear analyses and the bias analyses are based on 
successive fittings of a mathematical model to the data. A given model 
is first applied to the data and a statistic is computed that reflects 
the fit of that model. When an assumption is relaxed or an element 
added or deleted from the model, a new value for the statistic is 
obtained. The difference between successive statistics represents the 
change in the overall fii: of the mode^ to the data. The valae of this 
differ 3nce gives a test for the significance of the interactions of 
interest in the log linear analysis or of the assumption that a given 
item parameter is equal for the two groups in the bias analysis. The 
details of these procedures are given in Appendixes A and B, 
respectively. 

Results 

The equivalence of the samples for a given pair of tests was 
evaluated first by scoring the operational sections of the content area 
corresponding to the experimental sections. The results are shown in 
Table 5. Both means and standard deviations were very similar for the 
two samples within a group; hence they were treated as equivalent in 
subsequent analyses. 

Log Linear Analyses 

The results for the relevant interaction effects are shown in 
Table 6. The chi square value shown for the group-by-verslon 
interaction is the difference between the chi square of fit for the 
three main effects (item pair, item version, and group) and the chi 
square of fit for these effects plus the group-by-version effect. The 
three-way interaction is evaluated by determining the chi square 



28 



T^ble 5 



Performance of Samples 

a b 

on Operational Tests ' 

Blacks Whites 
Test 1 Test 2 Test 1 Test 2 

Verbal 

Mean 32.00 31.4 46.3 46.0 

SD 11.3 11.3 10.8 11.0 

N 300 280 1940 1920 



Quantitative 

Mean 27.1 ^6.5 38.6 38.5 

SD 10.4 9.7 9 7 9.7 

N 315 320 1900 1935 



Analytical 

Mean 19.1 19.0 27.3 27.4 

SD 6.7 6.5 7.1 7.2 

N 285 300 1895 1925 



^ean scores are given for the operational sections in the same content 
area as the experimental section. 

^Ns will generally be different from those shown in Tables 1 and 2. In 
the case of Black examinees ^ some were included in the analysis who 
did not provide sex or major subject on their registration form. In 
addition, the computer program from whi^h these data were taken rounds 
down to an even multiple of 5. 



29 



-28- 



Table 6 

Chi Square Values for 
Inceraccions of Group and Version 

Log Linear Analyses 



Group by Version Group by Version By Item 



Hypothesis 


2 

X 


df P 


2 

X 


df 


P 


1 1 
1 • 1 


1 1 "in 


1 AAA 




A 
H 




1.2 


.50 




2.68 


5 




1.3 


.11 




3.44 


4 




V 1 

£>m L 


Aft 


1 * 




0 




2.2 


12.76 


1 *** 


8.00 


4 


* 


3.1 


.50 




14.76 


4 


*** 


3.2 


8.26 


1 *** 


6.81 


4 




4.1 


1.23 




3.81 


4 




'i.2 


3.27 


1 * 


1.70 


4 




4.3 


1.14 




3.85 


4 




5.1 


8.65 


1 *** 


13.97 


4 


*** 


5.2 


.90 




.89 


5 




6.1 


.66 




8.31 


9 




6.2 


2.50 




12 36 


5 


** 


7.1 


2.80 


1 * 


27.27 


5 


*** 


7.2 


.94 




11.37 


5 


** 



* P < .10 
** P < .05 
*** P < .01 



& Version effects were not significant for these hypotheses. That Is, 
when the two groups were pooled, the two Item versions did not differ 
In difficulty. 



30 



for all effects except the three-way interaction (group-by-version- 
by-item pair). If the chl square is significant, it indicates that the 
three-^ay interaction must be taken into account if the model is to fi;: 
the data. Results for all effects are given in Appendix A. 

Four of the hypotheses had highly significant group-by-^ersion 
interactions: Hypothesis 1*1, antonyms versus sentence completion; 
2.2» the effects of adding (or modifying) prefixes/suffixes to the 
stimulus word in antonym items; 3«2« one true versus one false response 
in analytical items; and 5.1» test wiseness cues in verbal items. 
Marginal group-by-version interaction effects were noted with three 
other hypotheses: 2«I» substitution of a more difficult word for the 
correct response; 4.2» presence or absence of the word most or best ; 
and 7.1« the use of numbers versus symbols In quantitative items* Of 
these » Hypotheses I.I, 2.1, 2.2, 5.1, and 7.1 also had three-way 
interactions. Three other hypotheses had significant three-way 
interactions, but nonsignificant group-by-version interactions. These 
were Hypothesis 3.1, one true versus one false response for verbal 
items; €*2, the position of the key in quantitive items; and 7.2, 
diagrams versus verbal descriptions in quantitative items. 

In general, the three-way interaction indicates that the 
group-by-version effect is different for the different items making up 
a hypothesis. In order to understand the nature of this interaction 
more clearly separate analyses were performed for each item in the 
eight hypotheses that had significant three-way interactions. In the 
analysis of an individual item, the contingency table was a 
two-by-two-by-two table with group, version, and response (right/ 
wrong). The chi square value was the fit statistic for a model with 
the two main effects only. If this value was significant, it indicated 
that the main effects alone did not fit the data and hence the 
interaction was required. 

Table 7 provides the results for those items where the probability 
of the obtained chi square was less than .20. Higher levels of 
significance are indicated with asteriks. In order to achieve a 
clearer understanding of the different effects, however, further 
information is required. For this purpose. Table 7 also provides "odds 
ratios" and an indicator of which group or version had the larger 
effect. The odds ratio is the ratio of right to wrong responses for 
Version B divided by the ratio of right to wrong responses for Version 
A for each of the two groups. In both instances, if the effect of the 
manipulacion is zero the odds ratio will equal one. A ratio greater 
than one indicates that Version B is easier than Version A; a ratio 
less than one indicates the reverse. Once the value of the odds ratio 
departs from one, however, the ratios for the two groups are no longer 
stricty comparable since the magnitude of the ratio is related to the 
relative difficulty. Hence, the indicator of the larger effect given 
in the table is based on item difficulties (percent of correct 



31 



-30- 



Table 7 

Analyses for Individual Items 
In Hypotheses with Significant 
Three Vay Interactions 



Odds Ratio 



Largest Effect 



nypotnesxs 


Xtem 




P 

IT 


oxacK 


If nx Lc 


1^1^ Aim 
wLUUp 




1.1 


1 


1.70 




.96 


1.24 


Uh 


B 




2 


2.09 




11.37 


15.80 


Wh 


B 






J. HI 




1 1 Q 




wn 


o 




c 
J 


J . 00 




9 1 


1 A 


ox 


o 


2.1 


7 


32.13 


*** 


9.32 


34.97 


Wh 


B 




0 


i . /o 




1 Q7 


9 

^ . OJ 


OX 


A 
A 






1 . o/ 




9 


1 lA 
J.JO 


ox 


A 
A 




1 9 


J . 07 




1 1 1 


1 7ft 




ft 
O 


2.2 


13 


3.80 


** 


1.21 


1.72 


Wh 


B 




1 A 


9 ftn 
z . ou 




1 90 


1 7Q 


If n 


o 




1 c 
1 D 


i J. OH 




AO 
• 07 




ox 


o 




17 


2.04 




.88 


i.ie 




B 


^ 1 


1 ft 


J. J J 


** 




1 AA 


ft n 


o 






A 07 
H . U/ 




7*^ 




Uh 
If n 


A 
A 








** 


. t J 


71 
. / A 


ox 


n 

o 


5.1 


23 


1.66 




.61 


.58 








25 


4.14 


A* 


.77 


1.19 




B 




29 


12.54 


*** 


.69 


1.32 


Bl 


B 


6.2 


13 


3.36 


* 


.88 


1.21 




B 




21 


4.50 


** 


.66 


.98 


Bl 


B 




25 


1.62 




1.20 


.96 


Bl 


A 




29 


6.30 


*** 


.54 


.86 


Bl 


B 


7.1 


1 


3.60 


* 


.69 


.49 


Wh 


A 




17 


23.78 


*** 


.41 


.16 


Wh 


A 




27 


3.90 


** 


.40 


.56 


Bl 


B 


7.2 


3 


2.31 




.69 


.91 


Bl 


B 




6 


3.32 


* 


.48 


.66 


Bl 


B 




15 


2.23 




.95 


.73 


Wh 


A 




23 


2.89 


* 


.65 


.44 


Wh 


A 



* P < .10 
** P < .05 
*** P < .01 



ERIC 



32 



-31- 



responses). This indicator is often, but not always, in agreement with 
the relative magnitude of the ratios for Blacks and Whites. 

The three-^ay interactions may have resulted from a lack of 
consistency in three possible dimensions. First is the relative 
difficulty of the two versions. If one version is always more 
difficult than the other, the odds ratios will all be greater than one 
or less than one. For example, in Hypothesis 2*1, Version B was always 
easier than Version A, while in Hypotheses 7.1 and 7.2, Version A was 
always easier than Version B. Second is a consistency in the group for 
which the effect is larger. Hypothesis 1.1 showed larger differences 
for Whites in three of four cases while Hypothesis 6.2 similarly showed 
larger differences for Blacks. Third is the version that showed the 
greater group, differences. For example, both Hypotheses 1.1 and 2.2 
consistently showed larger effects in Version B. 

A brief summary of the log linear results is as follows: 

Hypothesis 1.1 . A significant group-by-version interaction was 
found. The effect was larger for Whites and the difference between 
groups larger for Version B (sentence completion). The three-way 
Interaction appears to be due primarily to item 5 (ingenuity), which 
showed larger differences for Blacks. 

Hypothesis 1.2 . This hypothesis showed no significant 
interactions but also showed no version effect. That is, the item 
manipulation resulted in little change in difficulty for either group* 

Hypothesis 1.3 . No interactions of interest were found here. 

Hypothesis 2.1 . The group-by-version effect for this hypothesis 
was marginal, perhaps due to the off-setting effects of the items 
contributing to the three-way interaction. Although Version B (more 
common meaning) was consistently easier, two itemis (grave and curb) 
showed larger effects for Blacks and Version A and two (conventional 
and dampen) for Whites and Version B. 

Hypothesis 2.2 . This hypothesis showed a strong groep-by ersion 
interaction with larger differences between groups for Version B (more 
common meaning). The three-way interaction appears to have resulted 
from an inconsistency in the relative difficulty of the two item 
versions for Blacks. Items 13 and 14 (unimpeachable and ungainly) both 
showed Version B easier than Version A and larger effects for Whites. 
Items 15 and 17 (impalpable and guileless), however, showed Version A 
(less common meaning) easier than Version B for Blacks. Item 15 also 
showed a larger effect size for Blacks, though for item 17, effects 
were nearly equal for Blacks and Whites, but in opposite directions. 

Hypothesis 3.1 . This hypothesis did not show the expected 
group-by-version interaction. Agaln« however, the patterns of the 
items contributing to the three-way Interaction suggest that the 



o 33 

ERIC 




-32- 



effects were in different directions » canceling each other out when all 
items were combined. All three items with significant effects showed 
different patterns. Item 18 showed larger effects for Whites and 
Version B (one true response), item 26 for Whites and Version A, and 
item 32 for Blacks and Version B. 

Hypothesis 3.2 . This hypothesis showed a significant group-by- 
version interaction with Version A (one true response) showing the 
larger difference. The effect size was not consistently larger for 
Blacks or Whites, however, and Version A was consistently easier only 
for Whites. 

Hypothesis 4.1 . This hypothesis showed no interactions of 
interest. 

Hypothesis 4.2 . This hypothesis showed a marginal group-by- 
version interaction that is particularly interesting since the overall 
version effect is not significant. This seems to have occurred since 
the manipulation had virtually no effect on Whites, much the larger 
sample. The effect for Blacks was small, but sufficient to produce the 
small interaction. Version A ("best" or "most" absent) was easier than 
Version B. 

Hypothesis 4.3 , No interact/ ns of interest were found for this 
hypothesis. 

Hypothesis 5.1 . Both a group-by-version interaction and a three- 
way interaction were found for this hypothesis. Version A (cues 
absent) was easier for Blacks, but Version B was more often easier for 
Whites and had the larger effects. For items 25 and 29. the effect 
sizes for Blacks and Whites were nearly equal and in opposite 
directions. 

Hypothesis 5.2 . No interactions of interest were found for this 
hypothesis. 

Hypothesis 6.1 . This hypothesis showed no interactions of 
interest- 

Hypothesis 6.2 . The group-by-version interaction was not 
significant although the probability of the obtained chi square was 
less than .20. The three-way interaction was significant, however. 
Generally, the effect was larger for Blacks and differences were larger 
for Version B (key is C). The exceptions, however, were probably the 
more important contributors to the three-way interaction and may have 
been sufficient to cancel out some of the overall effect. While 
Version A tends to be easier for both groups. Version B is easier for 
one item for each group, but not the same item. One item showed a 
larger group difference for Version A. 

Hypothesis 7. 1. This hypothesis had a marginal group-by-version 
interaction but a highly significant three-vay interaction. Since 
Version A (numbers) is consistently easier, the interaction appears to 



ERIC 



34 



-33- 



be due to the change In both the version and group showing the larger 
effect. Items 1 and 17 h<id larger differences for Whites and Version 
A, while Item 27 showed a larger difference Blacks and for Version B- 

Hypothesis 1 This hypothesis shows only a three-way 
Interaction. Of the four Items with relatively large effects two show 
results nearly opposite and similar In magnitude, essentially canceling 
out the group-by-verslon Interaction. Although Version A (diagrams) Is 
uniformly easier, Items 3 and 6 show larger effects for Blacks and for 
Version B. The other two Items show larger effects for Whites and for 
Version A. 



Analysis by Reading Passage 

The complexity of the Interactions for Hypotheses 3.1 and 5.1 Is 
particularly Interesting If one realizes that the Items for these 
hypotheses, as well as those for Hypothesis 4.1, all relate to the same 
three reading passages. The content of these reading passages may be 
one source of the differences In effect of the Item manipulations. In 
order to evaluate this possibility, the five Items for a given passage 
were analyzed separately without consideration for the particular 
hypothesis that dictated the two Item versions. The results are shovn 
In Table 8. 

For Passage 2, the group-by-verslon Interaction was not significant, 
although the three-way Interaction was. Examination of the results for 
the five Items related to this passage does not reveal any particular 
consistency likely to be related to an Independent effect of the passage. 
The three-way Interaction appears to be the result of different effects 
for the different Item manipulations associated with the three 
hypotheses. 

Passage 1, on the other hand, shows a strong group-by-verslon 
Interaction but no three-way Interaction. The strength of the 
interaction effect Is particularly Interesting since three of the five 
Items for this passage are associated with the nonsignificant Hypothesis 
4.1. In contrast with the results for the other passages, the effect of 
most Interest Is the larger effect for Whites In all Items although 
Version B also consistently was easier and had the larger effect. The 
similarity of patterns for all five Items may be due In part to 
Influences of the passage content. 

Passage 3 has both a small group-by-verslon ef f \ct and a three-way 
Interaction. The Interaction appears to be due to differences associated 
with Item version and hence was probably related to the Item 
manipulations associated with the different hypotheses. The effect sizes 
were, however, consistently larger £.r Blacks. This Is particularly 
striking since the verbal Items generally showed larger effects for 
Whites • Thus, this consistency may again be associated with the passage 
content. 

Considering these results In light of the complex three-way 
Interactions for Hypotheses 3 1 and 5.1, passage content does appear to 



ERLC 



35 



Table 8 



Results by Reading Passage 



Group by Version 
2 

Passage ^ df P 
1 6.57 1 *** 



.27 1 



3.57 1 * 



Group by 
Version by Item 



11.38 4 



15.14 4 



Odds Ratio Larger Effect 



df 


P 


Item 


Hyp. 


Black 


White 


Group 


Versl 


4 




18 


3.1 


1.20 


1.88 


Wh 


B 






19 


3.1 


1.89 


2.05 










20 


4.1 


.99 


1.17 


Wh 


B 






21 


4.1 


1,03 


1.36 


Wh 


B 






22 


4.1 


1.42 


1.55 


Wh 


B 


4 




23 


5.1 


.61 


.47 










24 


4.1 


1.40 


.176 


Wh 


B 






25 


5.1 


.77 


1.19 




B 






26 


3.1 


.73 


.51 


Wh 


A 






27 


5.1 


.86 


.85 






4 


*** 


28 


4.1 


1.16 


.98 


Bl 


A 






29 


5.1 


.69 


1.32 


Bl 


B 






30 


5.1 


.86 


1.07 


Bl 


3 






31 


3.1 


1.63 


1.38 


Bl 


B 






32 


3.1 


.45 


.71 


Bl 


B 



* P < .10 
** P < .05 
*** P < .01 



36 



ERIC 



-35- 



have affected the results for Hypothesis 3.1 to the extent to which that 
Interaction reflected the switch of the larger effect from one group to 
the other* This hypothesis had two Iteos each related to Passages 1 and 
3 that showed opposite effects. The results for Hypothesis 5«1» however, 
do not appear to be clearly linked with passage content. 



Item Bias Analyses 

Item bias analyses give even more detail about the results for 
Individual Items than the log linear analyses. This Information only 
pertains to comparisons of the two groups within an item version, 
however; hence » it must be considered a supplementary analysis # These 
analyses go beyond the log linear analyses presented here in that item 
properties other than difficulty are considered and the abilities of 
the two groups are taken into account* 

The method chosen assumes that an unbiased item functions in the 
same way for all examinees, regardless of group membership* Hence, the 
probability of a correct response at a given level of ability can be 
determined without referer ^ to any particular sa^>le of examinees* 
The probability of correct espouses over a range of abilities is 
represented by the item characteristic curve* If an iteH is unbiased, 
this curve will be the same for both groups of examinees* Equivalence 
of the curve is established if each of the three parameters of the 
mathematical function defining the curve are the same* These three 
parameters correspond roughly to the Item difficulty, discrimination, 
and probability of correctly guessing at low levels of ability* The 
method used here finds parameters to be equal if the fit of the model 
to the data is not significantly different when the parameters are 
asSiUned to be equal from when they are permitted to vary* More detail 
concerning item response theory and the method used are given in 
Appendix B* 

Because of the time and expense involved in using item response 
theory methods, however, bias analyses were performed only for those 
hypotheses where results seeded most likely from preliminary analyses 
of the data* The analyses were performed for verbal Hypotheses 1*1 
2.2, 3*1, and 5*1; quantitative Hypotheses 6*2, 7*1, and 7*2; and 
analytical Hypothesis 3*2* Details of the results are provided in 
Appendix B. 

One might speculate about a finding of bias in relation to the log 
linear results for particular items* Items biased solely in difficulty 
should also show group-by-version interactions in the log linear 
analyses of individual items, with the version in which the bias 



ERLC 



37 



-36- 



occurs showing larger effects if the bias favors Whites and smaller 
effects If the bias favors Blacks. Likewise, where an Iten Is 
sufficiently difficult to reject a hypothesis of equal probabilities of 
correctly guessing, the difference should be reflected In the 
log-linear analyses. If bias exists In discrimination, however, it 
might be expected to manifest itself differently, depending on the 
ability distributions of the two groups. In Instances where the group 
favored is different depending on the ability range, nearly any resi It, 
including no effect, might be expected from the log linear analyses. 

The items detected in the bias analysis are also likely to be 
outliers, items functioning somewhat differently from the others, and 
hence likely to be among those contributing to a three-way interaction. 
Conversely, a hypothesis without a three-^ay interaction is unlikely to 
contain a biased item unless the bias is unassoclated with the 
manipulation. In this case the effect of the bias would be about the 
same for both items of a pair. The bias would then contribute to the 
item-by-group interaction, but not the three-way interaction. 
Unfortunately, the prellmary analyses were focused primarily on the 
effects of the group-by-verslon interaction so that Hypothesis 3.2, 
v'hich does not have a three-way interaction, was selected for these 
analyses while Hypothesis 2.1, which has a large three-way interaction, 
was missed. 



The results for items where bias was detected are shown in Tables 
9 and 10. The three parameters are indicated, where a reflects item 
dlscrimatlon, b reflects item difficulty, and c is the probability of 
guessing correctly. A more discriminating item has a higher a value, 
and, as difficulty increases, the value of b Increases (items with 
negative values are easier than items with positive values). Bias is 
indicated when the values of a given parameter are not the same for 
Blacks and Whites. Notice that while some of the differences in 
parameter values seem snail, these values provided a significantly 
better fit to the data than when the values were held equal. 

A brief summary of the results and a comparison with the log 
linear results are presented for each hypothesis as follows: 

Hypothesis This hypothesis showed one item biased, item 5 

Version B (sentence completion). This agrees with the log linear 
analyses where item 5 was the item that appeared to account for the 
three-way interaction for this hypothesis. This was the only item 
where the effects were larger for Blacks. 

Hypothesis 2.2 . The analyses for this hypothesis yielded two 
biased items, item 15 Version A (less common meaning) and item 17, both 
versions. Item 15 also showed a particularly large effect in the' 
log linear analyses. Again, this was the only item of the set where 
the effect was larger for Blacks although it was also larger tor the 
unbiased Version B. Item 17 showed only a small effect. This may be 
because the two item versions appeared to be biased in very similar 
ways, both in item difficulty and discrimination, suggesting the bias 
may not have been related to the item manipulation. Both items, 
however, appeared to be contributing to the three-way interaction for 



38 



-37- 



Table 9 

Summary of Items 
Identified by Bias Performance 



Verbal Hypotheses 



Parameter Value 



Hypothesis 


Item 


Version 


Group 


a 


b 


c 


1.1 


5 


B 


Whites 
Blacks 


*> 
.3 


-.5 
.5 


.2 
.2 


2.2 


15 


A 


Whites 
Blacks 


. 5 
.5 


.7 
.7 


25 
.15 


2.2 


17 


A 


Whites 
Blacks 


1.2 
1.0 


1.0 
.3 


.25 
.15 






B 


Whites 
Blacks 


1.2 
1.0 


1.0 
.4 


.20 
.20 


3.1 


18 


A 


Whites 
Blacks 


.4 
.5 


.4 
.9 


.14 
.14 






B 


Whites 
Blacks 


.8 

2,0 


-.3 
-.3 


.14 
.14 


5.1 


25 


A 


Whites 
Blacks 


1.3 
1.3 


2.0 
2.0 


.22 
.23 






B 


Whites 
Blacks 


.4 
.4 


2.9 
2.9 


.19 
.15 



39 



as- 



Table 10 

Summary of Items 
Identified by Bias Procedures 

Quantitative Hypotheses 



Parameter Va lue 



Hypothesis 


Item 


Version 


Group 


a 


D 


c 


6.2 


13 


B 


Whites 


1.5 


.5 


.63 








Blacks 


.9 


-.5 


.26 


6.2 


28 


A 


Whites 


1.2 


02 


.12 








Blacks 


1.4 


.5 


.12 






B 


Whites 


.9 


.1 


.05 








Blacks 


1.5 


.3 


.06 


7.1 


17 


A 


Whites 


1.2 


- 8 


.27 








Blacks 


1.0 


-.8 


.08 






B 


Whites 


1.5 


.7 


.14 








Blacks 


1.5 


.7 


.16 


7.1 


19 


A 


Whites 


6 


-2.3 


.33 








Blacks 


1.3 


-2.0 


.37 






B 


Whites 


.5 


-.8 


.14 








Blacks 


.7 


-.3 


.16 


7.1 


27 


A 


Whites 


.8 




. .J .J 








Blacks 


.8 


-.3 


.37 


7.2 


2 


B 


Whites 


.9 


-.5 


-01 








Blacks 


.9 


-.5 


.14 


7.2 


3 


B 


Whites 


.4 


-.4 


.01 








Blacks 


.65 


.25 


.14 


7.2 


6 


B 


Whites 


.7 


-.6 


.01 








Blacks 


1.6 


-.3 


.14 


7.2 


15 


A 


Whites 


1.8 


.5 


.40 








Blacks 


.9 


-.1 


.16 


7.2 


23 


A 


Whites 


1.0 


-.3 


02 








Blacks 


1.7 


.2 


.07 



ERIC 



40 



this hypothesis, as the "more comnon" word was actually more difficult 
for Blacks In both Instances. 

Hypothesis 3 A . This hypothesis shoved only one biased Item, Item 
18. both versions. In this Instance, however, Version A (one false 
answer) was biased In difficulty, and to a very small extent, In 
discrimination. Version B showed only large differences In 
discrimination. Perhaps this bias was yet another contributor to the 
complex th.ee-way Interaction for this hypothesis. 

Hypothesis 3.2 . The hypothesis was not found to have any biased 
Items, not surprisingly In view of the nonsignificant three-way 
Interaction In the log linear analyses. 

Hypothesis 5.1. Bias was found for this hyoothesls only when an 
assumption that the probability of correctly guessing by lower ability 
examinees was the same for both groups was found not to fit che data 
for either version. Slnci: Item 25 was the only item of thik. set 
difficult enough for sufficient data to be available to reject this 
hypothesis, this was the Item assumed to be biased. This result may 
not be very me&nlngful, however, a^ th« probability of a correct 
response by guessing (the value of c) actually equals the proportion of 
correct responses for Blacks on Version A and Is only slightly lower 
than the proportion of correct responses for Version B. This suggests 
that that Item was so difficult for Blacks that the responses were 
nearly random. 

Hypothesis 6.2 . For this hypothesis, two Items were found to be 
biased. Item 13 Version B and Item 28, both versions. The result of 
Item 13 showed an unusually large bias In the pi:c*^<^blllty of correctly 
guessing that ^ vors White examinees. This Is consistent with log 
linear analyses where Item 13 Is the only Item to show Version B (key 
not "C") easier than Version A for Whites and the only Item whare the 
effect for Whites Is as large as that for Blacks. Item 28 did not show 
an eftect In the log linear analyses. Again* that result may be 
assoc^.ated with aspects of the Item that were the same In bsth versions 
rather than with the Item manipulation. 

Hypothesis 7.1 . Three Items were biased for this hypothesis. 
Items 17 and 19, both versions, and Item 27, Version A. The results 
for Item 17 were different for the two versions, however, with only a 
small difference In the probability of guessing correctly for Version B 
(symbols). Version A (numbers) showed a small difference In 
discrimination but a fairly large difference In the probability of 
guessing correctly. This version might therefore be expected to show a 
larger effect for Whites the group favored, as It did In the 
log linear analyses. Item 27 shows i fairly small difference favoring 
Blacks in the probability of guessing correctly. The log linear 



41 



•40- 



analyses show larger effects for Blacks and for Version B, as would be 
expected if the bias acted to reduce differences on Version A. Item 
19, however, did not show an effect in the log linear analyses, again 
s iggesting the bias was not associated with the item manipulation. 

Hypothesis 7.2 This hypothesis showed bias in all but one of .ae 
items. Items 2» 3, and 6 showed bias in Version B (descriptions) and 
15 and 23 in Version A (diagrams). For items 3, 6, 15, and 23, the 
bias is in all three parameters, making the effects somewhat difficc^t 
to predict, although in fact the version showing the bias for each was 
also the version showing the larger effect in the log linear analyses. 
Thus, the two sets of analyses are fully consistent for this 
hypothesis. 

Discussion 

The overall hypothesis of this study was that differences in the 
performance of Blacks and Whites could be demonstrated by 
systematically varying particular aspects of small sets of test items. 
Seven of the 16 hypotheses showed the group-by-version interaction 
expected if the item manipulation suggested by each hypotheses had a 
different effect for the two groups. Three other hypotheses showed a 
significant three-way interaction in which effects in different 
directions for different items appeared to cancel each other out so 
that, when the data were combined across items, the group-by-version 
interaction was not signxf icant. The overall hypothesis that such 
differences can be demonstrated would thus appear to be supported by 
the results of this study. 

Supportive evidence for the particular hypothesised source and the 
proposed rationale for the expected effect was, however, much less 
clear* These issues will be discussed with regard to each hypothesis 
below, but a few general observations can also be made. First the 
picture created in evaluating many of the specific hypotheses was far 
more complex than had been anticipated. The elements of interest 
seemed sometimes to interact with content and sometimes with other 
elements of the item. The existence of three-way interactions and 
numerous biased items for several -f the hypotheses suggests that the 
manipulated elements were not the only ones acting to produce the 
observed differences between groups or between versions. These 
interactions, however should serve as a source of additional 
hypotheses concerning differential effects on performance. 

Another finding was that particularly for verbal items, the 
performance of Whites tended to differ more between the two item 
versions than did that of Blacks. This result was contrary to 
expectation and was further unlikely since sample sizes were such that, 
by chance alone, changes between versions would be expected to be 



42 



-41- 



larger for Blacks for whom the standard error would be larger- On the 
other hand, the theory suggests that bias may effect either group, 
lhat Is, the same difference between the observed performance of Whites 
and Blacks would result regardless of whether a partlcul&r xtem feature 
made an Item easier for Whites or made the Item more dtlficult for 
Blacks* Since these hypotheses were In several Instances drawn from 
the results of studies where such a distinction between Items fcvorlng 
Whites and those disfavoring Blacks could not be made, a mistake In 
direction Is not surprising. While the results of this study support 
the hypothesis that such factors do affect the performance of Blacks 
and Whites differently, the rationale for why such factors 
differentially affect performance may need to be reexamined in some 
Instances and perhaps discarded. This distinction between whether the 
effect occurred and why it occurred should be considered in the 
following dl((Cusslon of the results for the hypotheses In this study. 
(In reading these summaries. It may be helpful to refer to the 
descriptions of the hypotheses In Table 3* ) 

Hypothesis 1.0 

Hypothesis 1.0 was concerned with Item format or the structural 
elements of the form In which a particular concept Is to be measured. 
This hypothesis was evaluated in all three areas: verbal Hypothesis 
1.1, quantitative Hypothesis 1.2, and analytical Hypothesis 1.3. 

Hypothesis 1.1 compared performance of Blacks and Whites on 
antonyms versus sentence completions. The rationale was that, to the 
extent that sentence completion items depended on vocabulary knowledge, 
the context provided by the sentence would help Blacks more than 
Whites. The results* however, showed larger effects for Wkiites and for 
the sentence completion items. Moreover despite the expectation that 
sentence completion items would be easier than antonyms, this v^s the 
case for only three of the five item pairs. An alternative hyv hesis 
might be that the cues provided in sentence completion items art of a 
nature to provide more help to White than to Black examinees. Fo* 
example, the one biased item appeared, on examination, to require an 
unstated value Judgement in order for the intended response to be 
correct. This suggests that the contextual cues provided by the 
sentence as well as the level of vocabulary were a source of variation 
in the differences between groups. This possibility deserves further 
exploration. 

Hypotheses 1.2 and 1.3, for which the item versions were more 
similar to each other than those used with the verbal items, showed 
little evidence of differences between groups. Hypothesis 1.2 
concerned the quantitative comparison format used in the quantitative 
sections of the GRE General Test. Although results here were negative, 
the outcome for Hypothesis 7.2 discussed below sugge ts that this 
format may interact with the type of content measured. In particular, 
the four items showing an effect for HypOvhesis 7.2 were all geometry 
items. For Hypothesis 1.2, the item with the largest difference was 
one of two geometry items in the set of six developed for this 
hypothesis. This item had a pattern of results very like items 3 and 6 

ERJC 43 



-42- 

In Hypothesis 7.2. which are quau!tltatlve cooparlson Iteiis. Hence, a 
different conclusion might have been reached with a different set of 
Items. The results from this study should therefore not be considered 
conclusive evidence that this format has no differential impact on the 
performance of Black and White exan'nees. 

For Hypothesis l.O, then, some differential Impact was shown for 
antonyms versus sentence completions, although apparently not for the 
reasons anticipated o No evidence was found that quantitative 
comparisons or the Roman numeral format presented special difficulties 
for Black examinees. Other common Item types might be examined in 
further studies. 

Hypothesis 2.0 

Hypothesis 2.0 was concerned with the practice of manipulating 
item difficulty by requiring less common word meanings or by adding or 
deleting prefixes or suffixes. This practice was expected to have more 
impact on the performance of Blacks than on that of Whites. Both 
Hypotheses 2.1 and 2,2 were verbal hypotheses and were evaluated using 
anton3nn items. 

For Hypothesis 2.1, the stimulus word was the same tt>r both 
versions with the item difficulty manipulated by the word provided as 
the correct answer. In most of the items, all other options remained 
the same. This hypothesis did not show the expected group-by- vers ion 
interaction, although this was probably due to the opposite effects of 
two pairs of items revealed in the three-way interaction. The 
manipulation of difficulty through the particular antonym chosen as the 
correct response aoes appear to affect the two groups differently, but 
results suggest that some elements other than vocabulary are also 
contributing to the effect on performance. Careful study and analysis 
of the items with opposing results may suggest what such elements might 
be, but nothing is readily apparent as an explanation. 

Hypothesis 2.2 was originally defined according to the presence or 
absence of a suffix or prefix with the (unstated) assumption that the 
version without the suffix or prefix would be easier than the one with. 
With the particular words chosen, however, the -with" version was 
sometimes the more common usage. The results were found to be more 
consistent if redefine in terms of common or uncommon rather than 
present or absent. revision did not change the major anomaly, 

however, that the ver ,xon that was easier for the White group was more 
difficult for the Blacks for two items. These two items were those 
identified by the bias analysis. 

Hypothesis 2.2 was one of the four hypotheses that showed a strong 
group-by -version effect in the log linear analyses. The larger 
between-group differences, however, were associated with the more 
common rathet than the liss common usage and the magnitude of the 
version effect was larger for Whites in two of the five items. A 
genuine effect appears to have been detected, but the Interpretation 
suggested by tl;e hypothesis as stated does not appear adequate to 



ERIC 



44 



-43- 



explaln the results. Because the prefix or suffix formed the antonym 
of the original word, the options were also changed. Characteristics 
of the dlstractors and key might be analysed In more detail for other 
sources of the observed effect, but again none Is Immediately obvious. 

The results for Hypothesis 2.0 do not suggest that manlpulaclons 
of vocabulary of the type hypothesised are uniformly more detrimental 
to the performance of Blacks than to that of Whites. Nonetheless, 
differential effects from the manipulation of vocabulary In antonym 
items were observed for reasons that are not yet apparent. These 
reasons are likely to be subtle, however since antonym Items are so 
spare, with very little Information provided beyond the six words of 
the stimulus and options. Nevertheless, these factors clearly have 
some Impact that can be observed in the results found here and their 
identification would be a worthy object of further research* 

Hypothesis 3.0 

Hypothesis 3.0 considered the effect on performance of Items where 
the requirement was to select the one option that Is false rather than 
the more usual choice of the one that Is true This hypothesis was 
evaluated In both the verbal and the analytical sections. Hypotheses 
3.1 and 3.2 respectively. 

For Hypothesis 3.1, the group-by-verslon effect was not 
significant, although the effects of the manipulation were quite 
different as reflected In a highly significant three-way Interaction* 
Three different patterns of effects were shown, evldentally cancelling 
each other out in sum. The supplementary analyses suggested that the 
reading passage Interacted with the results In such a way that the 
change between versions was larger for Whites on passage 1 and for 
Blacks on passage 3. The first passage was a reading from physics, the 
second from economics, and the third from biology. The biology 
passage however concerned skin plgnentatlon and although no ref- 
erence was made to humans It may have been of greater Interest to 
Black examinees. One item from Hypothesis 3.1 was found to be biased. 
It is Interesting to note that In the "one-false-answer" version of 
this item, the option that was the key on the alternate form appeared 
to have had a stronger attraction for Blacks than for Whites, 
suggesting that more Blacks than Whites may have misread the stem for 
this item. 

Hypothesis 3.2 was one of the four hypotheses with a strong 
group-by- vers ion interaction. The differences between groups were 
smaller, however, for the "one-false-answer** version than for the 
conventional version In four of the five Item pairs, the opposite of 
the expected effect. 

The change from **one true answer** to **one false answer** affected 
the performance of Blacks and Whites differently. Other factors must 
also have been operating with these Items, however. In the case of the 
verbal items, the passage content appeared to be one such factor* 
Further investigation of the effects of this item type Is to be 
encouraged . 



ERLC 



45 



-44- 



Hypothesis 4.0 

Hypothesis 4.0 concerns Che effects of Inferences required of the 
examinees as part of the task of the itea. This was tested with verbal 
items In Hypothesis 4.1 and with analytical itecs in Hypotheses 4.2 and 

Hypothesis 4.1 showed no Interactions of Interest In the log 
linear analyses. The Inference required by these Items, however, was 
not far removed from the passage, and an effect alght have been 
observed had the Inferential leap required been larger. 

Hypothesis 4.2 showed a marginally significant group-by-verslon 
Interaction. The change in version, however, was very slight, 
co.^slstlng only of the insertion of the word "most" or "beef into the 
item stem. The rest of the item was Identical. The change had 
essentially no effect at all for White examinees, but Blacks found the 
version with "most" or "best" more difficult for four of the five 
items. 

Hypothesis 4.3 did not show any interactions of interest in the 
log linear analyses. 

In general, the support for Hypothesiii 4.0 shown here Is weak. 
The particular choices of types of inference to investigate may not 
have bean the best choices however, and the case here may best be left 
a* ly unprover, with a suggestion that other types of inference be 
1 gated. Differential effects were observed in the items for 

analytical Hypothesis 4.2, however. Although the effect was small the 
item manipulation was well defined and probably deserves further atten- 
tion. Again, the Importance of small differences in wording are 
highlighted by these results. The fact that a change in a single word 
produced an observable effect is Impressive. 

Hypothesis 5.0 

Hypothesis 5.0 is concerned with the effects of test-wlseness cues 
on performance of Blacks and Whites. It was evaluated with verbal 
Hypothesis 5.1 and quantitative Hypothesis 5.2. 

Hypothesis 5.1 was one of the four hypotheses to show a strong 
group-by-version effect with greater differences where test-wiseness 
cues were present. The results also showed the version with cues 
tended to be easier for Whites while the version without cues was 
easier for Blacks. Although the three-way interaction suggests that 
other factors are operating with this hypothesis, as with others, the 
observed effects are generally in line with expectations. 

Hypothesis 5.2 showed littJe difference between versions but the 
cues used were based on option elimination strategies (Kunti, 1982) 
that may have been more sophisticated than most examinees actually use. 



ERLC 



46 



-45- 



In verbal items, therefore test wiseness does appear to affect 
the performance of Blacks and Whites differently. For quantitative 
Items, this difference appears to be negligible. Either quantitative 
Items do not lend themselves as well to these strategies or the 
strategies chosen to be evaluated are not the ones that matter. The 
results for Hypothesis 6.2 discussed below, where key placement may 
affect the usexulaess of a guessing strategy, do suggest that simpler 
strategies are being used, at least by Black examinees* This 
hypothesis definitely warrants further elaboration and refinement, 
probably with greater specificity In the particular test^lseness 
strategy expected to be employed. 

Hypothesis 6.0 

Hypothesis 6.0 concerned the effect of key placement on 
performance and was evaluated In two specific hypotheses: verbal 
Hypothesis 6.1 and quantitative Hypothesis 6.2. 

Hypothesis 6.1 showed no Interactions of Interest In the log 
linear analyses. This may be because the analogy Items used were so 
difficult that even the strongest dlstractor did not offer the kind of 
Immediate pull required for a person to fall to evaluate the full set 
of options. Further recent Investigation of changes In Item 
difficulty due to key placement found that effects were larger when the 
key was moved more positions. For exaiq>le, a change from position A to 
E produced a larger effect than from A to B (Golub-Smlth. 1984). The 
shifts In this study were most often a single position or only the 
dlstractor position was changed and not the key. Further, the 
strongest dlstractor and the key were usually In adjacent option 
positions. Another factor was the definition of "strongest." Although 
effects were not significant, somewhat different v^tterns resulted for 
Blacks If the dlstractor was most popular than If It drew mainly the 
highest scoring people. Hence, these Items may not have provided a 
good evaluation of the hypothesis. 

Hypothesis 6.2 did show a nonsignificant group-by-verslon 
interaction, but the three-niray interaction was significant in the log 
linear analyses. Bias analyses found two biased items, one of which 
appeared to be unrelated to the item manipulation. The other had 
effects somewhat different from the other items; hence it may have 
served to attenuate the group- by-version effect as well as contribute 
to the three-way interaction. If Black examinees are finding the 
material more difficult than Whites however this may Just mean that 
Blacks are more likely to be guessing at these items. 

Hypothesis 6.0 was supported only in part- A differential effect 
was observed for quantitative items, although Black examinees appeared 
more often than White to be taking advantage of a guessing strategy of 
selecting the center option. The results with the analogy items 
suggest that if this effect occurs, it is with simpler items or where 
the strongest dlstractor and the key are further removed in the list of 
options. If this hypothesis were refined somewhat and explored 
further, more interesting results might be obtained. 



ER^ 



47 



•46- 



Hypo thesis 7>0 

Hypothesis 7.0 concerned an abstr^ict or concrete ilmenslon that 
was evaluated In two quantitative hypotheses one contrasting numbers 
and symbols (7.1) and the other contrasting diagrams and verbal 
descriptions (7.2). 

The first of these. Hypothesis 7.1, showed a marginally signifi- 
cant group-by-verslon effect but a very l#rge three-n^^ay Interaction. 
Bias was also found In three of the five Items. Two of these also had 
large group-by-verslon effects In the log linear analyses. In con- 
trasting the three biased with the three unbiased Items, an obvious 
distinction emerged. The three unbiased Items were rela;:lvely 
straightforward and simply stated, aJrhough they were not the easiest 
Items In the set of six. The other three Items were "story problems", 
requiring the problem to be extracted from a verbal description. In 
some sense, therefore, a different kind of abstract or concrete dimen- 
sion was laid down across the Intended one- One of the story problems 
also contained an obvious error that m^.ght be made If the problem were 
not read carefully. This option drew ijoth groups strongly In the 
symbolic version B, but drew Whites much less strongly than Blacks In 
the numeric version. 

Hypothesis 7.2 had no group-by-verslon effect, but again had a 
significant three-way Interaction. Both the bias analyses and the log 
linear analyses Indicated effects In opposite directions » however, 
presumably cancelling each other out when effects were summed across 
Items. Results showed that for the quantitative comparison Items, the 
version without a diagram appeared to be biased and showed a larger 
effect for Blacks; In the standard format, the Item with the diagram 
appeared to be biased and showed a larger effect for Whites. This Is 
an Intriguing finding suggesting the possibility of an Interaction 
with Item format. The Individual items should be examined in more 
detail to see if other plausible explanations can be found 

The manipulations of the items making up Hypothesis 7.0 definitely 
had a differential effect on the performance of Blacks and Whites « Of 
the 15 items identified as biased by the item response theory 
procedures, eight were items from this hypothesis. The Interactions, 
however, were striking, and the simple expectation stated in Hypothesis 
7,0 seems Inadequate to account for the results- Other important 
sources of variation in the differences between groups are clearly at 
work here. One possible source of such variation is an interaction 
with the subject matter content. Geometry items, taken without regard 
to hypotheAls, are associated with larger between-group differences 
than algebra or arithmetic items. However, Hypothesis 7.1 contains 
only one geometry item and Hypothesis 7.2 only one item that is not 
geometry, making it nearly impossible to separate the subject matter 
content from the hypothesis except on the weight of a single item. 
Format may make a difference, although this seems in conflict with the 
negative result for Hvpothesis 1.2 which looks directly at the format 
effect. Possibly format differences occur only with certain types of 



48 



-47- 



items. For example, the item that has the largest effect for 
Hypothesis 1.2 is a geometry item. Possible effects of verbiage were 
again observed with the items in Hypothesis 7.1, The relationships 
between group performance and the abstract or concrete dimension in 
quantitative items seems definitely to warrant further investigation. 

Conclusions and Recommendations 

In this research, the most basic question was whether differential 
performance of Blacks and Whites could be demonstrated through the 
manipulation of relatively stable characteristics of test items* The 
answer to this question appears to be "yes". For several of the 
hypotheses the effects of the item manipulations were demonstrated to 
be different for the two groups. Beyond that level of generality 
however, conclusions were less clear- Although the degree of support 
for the seven gen<iral hypotheses varied, perhaps none was clearly 
unequivocally supported as stated. Nonetheless, the results are rich 
with information. A number of directions for further investigation 
have emerged and more should follow from additional scrutiny of the 
items for some of the hypotheses. 

One issue raised by the results concerns the interpretation of the 
effects. The hypotheses were stated in terms of how the manipulation 
was expected to influence the performance of Blacks. The results, 
however jhoved a larger effect for Whites in several cases. The 
question is why should this be? If the difference between groups 
becomes larger because of the effect on the performance of Whites the 
suggestion is that something has been done to enhance that performance* 
In Hypothesis 1.1 the difference between groups is larger for sentence 
completion items than for antonyms. What cues are we providing in 
these items that point to the correct response? Which of these are 
intentional? Are all of these helpful to both groups or are some 
relatively obscure to Black examinees? If the cues provided are not 
equally helpful, would avoiding those cues found to bo more difficult 
for Blacks adversely affect the validity of the test for either group? 

Similarly, items of the one-false-answer type have been identified 
as biased in previous studies more often than would be expected by 
chance. The results here (Hypothesis 3-0) supported a hypothesis of 
differential effects on performance, yet there was a suggestion that in 
some cases cues were being provided that were not equally accessible to 
both Black and White examinees. The interaction with content also 
bears further exploration. These items and the passages to which they 
refer need to be examined in much greater detail to see if possible 
reasons for the difference in effect for different items can be 
determined. 

A related issue is that of test wiseness. The hypothesis specific 
to test wiseness was Hypothesis 5.0 but t'le results for the key 
placment Hypothesis 6 2 are also related to a possible test^iseness 
type of strategy. A possibility suggested by these results is that 
Blacks and Whites differ, not in whether or not test-wiseness strate^ 
gies are employed, as is commonly assumed « but in which strategies 



49 



-48- 



are employed. Further, some test-wiseness cues may be less accessible 
to Black examinees so that the difficulty of the test-iriseness task is 
not equivalent for both groups. Again, this should be a fruitful area 
for further research. 

Hypothesis 2.2 also raises some issues. Many might question 
whether the differences observed as a result of the manipulations of 
antonym items actually constitutes bias. Vocabulary is, after all, 
part of what the test purports to measure. Ultimately this is a 
question of construct validity. Vocabulary is known to be associated 
with academic performance and has traditionally been included in 
academic aptitude tests. The use of a prefix or suffix to change an 
item from a more familiar to a less familiar form, for example, or 
other strategies such as making a verb from a word commonly used only 
as a noun, may require a reasoning about words different from that 
required in other antonym items- Worse, fcr some examinees the word 
may be recognized readily in either form, while for other examinees the 
meaning of the word must be inferred If an inference must be made 
more often by members of one group than by those of another, this 
difference in process might arguably be a source of bias. The relevant 
question may be whether the validity of the test would be harmed if 
such items were not used. 

Hypothesis 4.0 concerning inferences was clearly supported in only 
one of three specific hypotheses tested, yet the need for making 
inferences is implicit in the explanation of the results for the other 
hypotheses and specific items identified by the item response theory 
procedure as biased. Possibly the need to control the degree of 
inference by changing the passage or specific wording ot the stem 
prevented the kind of inferences that would have shown a difference 
from being used in this study. The challenge here may be in 
controlling the degree of inference in a meaningful yet measur/ible way. 

At the oame time, Hypothesis 4.2 yielded the only result that 
could be implemented immediately if that were desirable. The use of 
the word "most" or "best" in the item stem was meant to be clarifying. 
Instead it appeared to introduce a slight confusion for some Black 
examinees, perhaps suggesting that something was wanted in these item 
beyond what was immediately understood. This change, however had 
virtually no effect for White examinees. Deleting these superlatives 
would not appear to harm most examinees and might remove a source of 
confusion for some. If the evidence seems too weak to make a change in 
practice on the basis of this study alone, it might at least be 
replicated in anticipation of such an action. 

A final recommendation for future study concerns the experimental 
design. One of the more salient results of this study was the large 
number of interactions. Rather than try to more carefully control a 
single element of interest, two or more elements might be varied within 
the same items in such a way that the effe'^ts could be separated 
statistically. Such a separation was possible to some extent with the 
reading passages, but the number of items per passage per hypothesis 



50 



-49- 



should have been larger for more satisfactory analysis, and the 
variations to be expected according to passage should have been 
specified In advance Such an approach would add to the complexities 
of Item preparation, but should greatly Increase the usefulness of the 
results • The quantitative area may be particularly suitable for such 
an approach since the Item elements are probably more easily definable. 

What emerges clearly from this study Is hov little we know about 
the mechanisms that produce differential performance between Blacks and 
Whites. Still, the study has demonstrated that Item elements exist 
that are common to some number of Items measuring different content 
that do affect differently the performance of Blacks and Whites- 
Further Investigation of these elements should be fruitful In 
Increasing our knowledge concerning the causes of bias and their 
eventual remedy. 



mc 



51 



-50- 



Referencfts 

Golub-Smlth, M. L. (April 1984). The effects of option scrambling on 
listening co^rehenslon Items: An epyllcatlon of Item response 
theory . Paper presenr«%d at the annual meeting of the American 
Educational Research .oclatlon. New Orleans. 

Goodman, L A. (1978). Analyglng qualitative/categorical data. Log 
linear models and latent-structure analysis . Cambridge. Abt Books. 

Hambleton, R. K., & Cook« L. L. (1977). Latent trait models and their 
use In the analysis of educational test data. Journal of Educational 
Measurement , 14 , 75-96. 

Kuntz, P. (1982). Test-wlseness cues In the options of mathematics 
Items* Paper presented at the annual meeting of the American 
Educational Research Associations, New York. 

Lord, F. M. (1977). A study of Item bias using luem characteristic 
curve theory. In N. H. Poortlnga (Ed.), Basic problems In 
cross-cultural psycholof^ y. Amsterdam: Swlts and Vltllnger. 

Scheuneman, J. D. (1979). Academy of Certified Social Workers; 
Report on minority performance . Unpublished research report. 

Scheuneman, J. D. (1980). Latent trait theory and Item bias. In L. 
J. Th. van der Kamp, W. P. Langerak, & D. N. M. de Gruljter (Eds.). 
Psychome tries for Educational Debates . London: John Wiley & Sons. 

Scheuneman, J. D. (1981). A new look at bias In aptitude tests. In 
P. Merrlfleld (Ed.), Measuring human abilities (New Directions In 
Testing and Measurement, No. 12). San Francisco: Jossey Bass* 

Scheuneman, J. D. (1982). A posteriori analyses of biased Items. In 
R. A. Berk (Ed.), Handbook of methods for detecting test bias . 
Baltimore: Johns Hopkins University Press. 

Scheuneman, J. D. (1984) A theoretical framework for the exploration 
of causes and effects of bias In testing. Educational Psychology 19, 
219-225. ^ — 

Smith, J K. (1982). Converging on correct answers A peculiarity of 
multiple choice Items. Journal of Education al Measurement 19, 
211-220. — 

Thlssen, D. (1982a). CULT: Compleat univariate latent trait 
program . Unpublished manuscript. 

Thlssen, D. (1982b). Marglntl maximum likelihood estimation for the 
one-parameter logistic model. Psychometrika, 47, 175-186. 

Thlssen, D. , Steinberg, L. , & Wainer. H. (1983). On the measurement 
of item bias: A statistically rigorous methodology using item 
response theory . Unpublished manuscript. 



52 



-51- 
Appendix A 
Log Linear Analyses 

Loglt models are a class of log linear models that are analogous to 
multiple regression analyses* In general, regression models are additive 
models in which the weighted effects are summed to obtain a predicted value of 
the dependent variable, which is typically a coi Inuous variable* Log linear 
models are multiplicative; that is, the predicted value is the product of the 
weighted effects or the sum of the logarithms of the effects. In the logit 
models, the dependent variable Is the odds of a dichotomous variable* In this 
study, the odds of making a correct response to an item are predicted from 
group membership of the examinees, the particular item pair, and the two item 
versions that make up the pair* The advantages of logit models over standard 
multiple regression procedures for the analysis of categorical data are 
discussed in Goodman (1978)* 

For each set of items corresponding to the 16 specific hypotheses, a 
series of analyses were performed in which successive models were fitted to 
the data* To get some sense of the meaning of the different effects included 
in the models, consider the following example for one item* (For ease of 
conceptualization, percent of correct responses will be used in this example 
rather than the odds ratio — the number correct divided by the mmber 
incorrect — actually used in the computation*) 





Black 


White 


Version A 


Pll 


Pl2 


Version B 


P21 


P22 




P.l 


P.2 



Pi, 

P2. 



ERIC 



53 



-52- 

The objective Is to predict the performance of Blacks and Whites. If 
group membership has no effect on performance, the four cell values, p^^ , 
will be predictable from only the Information on porformance on the two 
versions, p^^ and pj^ , the difficulty of each version combining the two 
groups. Similarly If the Item ' erslon has no effect, the cell values, p^j, 
will be predictable from only the performance of the two groups combining data 
from the two versions, p , and p o • both group and version are Important 
and the effect of the Item version Is the same for the two groups, the cell 
values will be adequately predicted from the four "marginal- values, Pj , pj » 
P.l P.2 • however, this model with two main effects (group and 

varslon) does not fit the data. It must be assumed that an Interaction 
between group and version exists; that Is, the effect of the Item versions Is 
not the same for the two groups. Notice that the mcJel that Includes the 
Interaction predicts the cell values from the p^j , which are the cell values. 
This Is called the "saturated model", which has no degrees of freedom since 
the expected values In all cells will equal the obtained values and no 
possibility of variation exists. 

This simple example can be extended to the case where there are several 
Items. Here, If no effect exists for different Item pairs, the cell values 
can be predicted from the Item marginals, that Is, from the mean performance 
across all the Items of one version for one group. This would mean that there 
was no Item main effect. Likewise, the difference between versions may be the 
same for all Item pairs, or the group performance differences may remain the 
same for the set of Items. In these Instances, there would be no Item-by- 
version or group-by-ltem Interaction, respectively. 



54 



ERIC 



-53- 

In this study, the interactions of interest are the cwo-way, group-by- 
version interaction and the three-way, group-by-version-by-item interaction. 
The two-way interaction indicates that the two ite" versions are differential- 
ly difficult; that is, the difference in difficulty between versions Is larger 
for one group than for t.ie other. Hence knowing the difff^rence in difficulty 
for Versions A and B and the difference in performance for Blacks and Whites 
is not jufficient to predict the difficulty of each version for each group. 
The three-way interaction further specifies that these differences between 
groups and versions are not the same for the different item pairs in a set. 
If the three-way interaction occurs without a two-way interaction, one of two 
explanations is most likely. Either, one, the differential effect occurs for 
only some items or, two, effects are in different directions, that is, effects 
are larger for Black examinees for some items and for White examinees ""or 
others. (See Table 7 on page 30.) 

The models analyzed in this study are shown in Table 11. 
The saturated model is Mq which includes all effects — the three main effects, 
the three two-way interactions j and the three-way interaction. Model 
includes only the main e '*ects. If this model fits the data, uhe various 
Interactions oce not required. Models M^, Kj, and M^ evaluate th« effect of 
each of the separate main effects. The difference between the chi square 
statistic for model M^ and that for one of these models tests the signi- 
ficance of the effect not specified- For example, if the fit of M^ is 
significantly better than thrt of M2, the effect of the item n-lr (ihe effect 
liOt specified) provides a significant Improvement in the prediction of per- 
formance. In models M,, and M^, the separate interactions are evaluated. 



55 



-54- 

Table 11 
Log Linear Models Estimated 



Model Effects Im uded* 

Mq All (G, V, I, GV, GI, VI, WI) 

Mj Main Effects (G, V, 1) 

M2 G. V 

M3 G. I 

M4 V. I 

M5 G, V, I, GV 

G, V, I, GI 

G. V. I. VI 
Mg G, V, I, GV, GI, VI 



* G gioup main effect 

V Item version main effect 

I Item pair main effect 

GV group-by-verslon Interaction 

GI group- by-Item Interaction 

VI Item-by-verslon interaction 

GVI three-way Interaction 



56 



-55- 

Again, the significance of the interaction specified is tested by taking the 
difference between the result for one of these hypotheses and that for model 
Mj. The final model, Mg, tests for the three-way interaction. If this model 
does not fit the data, the three main effects and the three two-way inter- 
actions together are inadequate to explain the results. Hence, the three-way 
interaction, a different effect of item version for the two groups in dif- 
ferent item pairs, is significant. A more formal and detailed presentation of 
the log linear logit models is also provided by Goodman (1978). 

The data were analyzed using the SPSS-X log linear program. The results 
are given in Table 12. The first column gives the reference value chi square 
for model M^, where the chi square value is the likelihood ratio chi square. 
The last column for the three-way Interaction is the result obtained for model 

Kq. The other results are for the differences between the result for the 
8 

relevant model and the reference value for with the degrees of freedom for 
the difference. Unless otherwise indicated, the probability of the obtained 
chi square is less than .01. 

For none of the hypotheses did the model consisting of the three main 
eff.-^cts only (model M^) fit the data. For three hypotheses, 1.2, 4.2 and 6.1, 
however, the version main effect was not significant. For Hypothesis lc2, 
significant interactions of the item version with the item pair suggested 
that; while on the average the item version h?d no effect, the effect was 
different and apparently opposite for different item pairs. Hypothesis 4.2 
showed a marginally significant group-by-version interaction. 

For three of tl i hypotheses, one of the models with a single interaction 
wa.« found to fit the data. For Hypotheses 6.1 and 4.2, the model consisting 
of main effects and the group-by- item interaction (model M^) was appropriate. 



57 



Table 12 
Results of Log Linear Analyses 
All Effects 



Hypothesis 



All Main 
Effects 



Individual 
Main Effects 



Two-Way 
Interactions 



Three-Way 
Interactions 



df 



df 



df 



df 



1 1 
1 • 1 


91 7 A 


1 

Lj 


T 
1 


1 fiAR 
lOUJ 


H 




P17 

uv 


11 1 
11 . J 


1 
i 




11 n 


A 








V 


l'i9 






GI 


25.8 


4 














G 


460 






VI 


2152.4 


4 








1 9 


ZUo* 1 




1 


/, Q O A 






GV 




1 
1 


ns 




c 
J 








V 


0.07 




ns 


GI 


77.9 


5 














G 


743 






VI 


126.2 


5 








1.3 


172.7 


13 


I 


3442 






GV 


0.1 


1 


ns 


3.4 


4 








V 


4.4 






GI 


46.9 


4 














G 


322 






VI 


126.2 


4 








2.1 


1036.8 


19 


I 


3898 






GV 


3.5 


1 


* 


28.9 


6 








V 


1369 






GI 


52.9 


6 














G 


828 






VI 


968.6 


6 








2.2 


103.9 


13 


I 


1657 






GV 


12.8 


1 




8.0 


4 








V 


185 






GI 


38.8 


4 














G 


372 






VI 


40.7 


4 








3.1 


4-<:^.7 


13 


I 


958 






GV 


.5 


1 


ns 


14.8 


4 








V 


'8 






GI 


44.5 


4 














G 


430 






VI 


367.0 


4 









ERIC 



* P < .10 

** P < .05 Al' other effects are significant beyond the .01 level, 

ns P > .10 



56 



Table 12 (cont.) 
Results of Log Linear Analyses 
All Effects 



All Main 
Effects 



Individual 
Main Effects 



Two-Way 
Interactions 



Three-Way 
Interactioai^ 





2 

X 






2 

X 




P 




2 

X 


df 


p 


2 

Y 

A 


df 


3.2 


117.3 


13 


I 


1006 


4 




GV 


8.3 


1 




6.8 


4 








tr 
V 










9.3 


4 


* 












6 


323 


1 




VI 


93.7 


4 








A.l 


52.6 


13 


I 


20&3 






GV 


1.2 


1 


ns 


3.8 


4 








tr 
V 








CI 


6. 1 


4 


ns 












G 


525 


1 




VI 


41.2 


4 








4.2 


22.5 


13 ** 


I 


1347 






GV 


3.3 


1 


* 


1.7 


4 








V 


1 Q 

1.7 




ns 




17.4 


4 














G 


506 


1 




VI 


0.1 


4 


ns 






4.3 


113.1 


13 


I 


5840 






GV 


1.4 


1 


ns 


3.9 


4 








V 


67 






GI 


68.3 


4 














G 


327 






VI 


39.7 


4 








5.1 


.144.8 


13 


I 


2967 






GV 


8.7 


1 




14.0 


4 








V 


9.1 






GI 


21.2 


4 














G 


302 






VI 


103.3 


4 








5.2 


95.4 


16 


I 


2880 






GV 


0.9 


1 


ns 


0.9 


5 








V 


21 






GI 


57.8 


5 














G 


952 






VI 


36.8 


5 









* P < .10 
** P < .05 
ns P > .10 



All other effects are significant beyond the .01 level. 



ERIC 



59 



Table 12 (cont.) 
Results of Log Linear Analyses 
All Effects 



i^rpo thesis 



All Main 
Effects 



Individual 
Main Effects 



Two-Way 
Interactions 



Three-Hlay 
Interactiosi#^ 





2 

X 


df 


p 


2 

V 

X 


df 


p 




2 

V 
A 


df 


p 


x2 


df 


6.1 


124.8 


28 


I 


1252 


9 




6V 


0.7 


1 


ns 


8.3 


9 








V 


2.4 


1 


ns 


GI 


106.4 


9 














6 


772 






VI 


9.4 


9 


ns 






6.2 


102.3 


16 


I 


1996 






6V 


2.5 


1 


ns 


12.4 


5 








V 


18 






61 


55.0 


5 














6 


815 






VI 


30.5 


5 








7.1 


321.7 


16 


I 


841 






6V 


2.8 


1 


* 


27.3 


5 








V 


1747 






61 


7.4 


5 


ns 












G 


841 






VI 


274.5 


5 








7.2 


374.5 


16 


I 


1347 






6V 


0.9 


1 


ns 


11.4 


5 








V 


98 






GI 


49.1 


5 














6 


957 






v'l 


307.3 


5 









erJc 



* P < .10 
** P < .05 
lis P > .10 



All other effects are significant beyond the .01 level. 



60 



-59- 

The Chi square values were 18.4 with 19 degrees of freedom and 5.1 with 9 
degrees of freedom for the two hypotheses respectively. (The results for 
Hypothesis 4«2 are discussed further in the body of the paper.) For 
Hypothesis 4.1, none of the interactions with group membership was 
significant, so that model M^, with the three main effects and the 
item-by-version interaction, was found to fit the data. The chi square value 
for this model was 11.3 with 9 degrees of freedom. 

Three Hypotheses, 1.3, A. 3, and 5.2, were found to have significant 
group-by-item and item-by-version interactions. Although a model with these 
effects was not specifically evaluated, the nonsignificant group-by-verslon 
interaction and three-way interaction suggest such a model would fit the data. 

The re^-ining seven hypotheses show either the expect'xd groi.p-by-version 
interaction or the three-way group-by-version-by-item interaction. The 
results for these hypotheses are discussed in more detail in the body of the 
paper. 



61 



-60- 



APPENDIX B 
Item Bias A nalyse s 

In this method, an unbiased item is defined as one where all examinees of 
a given level of ability have the same probability of a correct reponse regard- 
less of their group membershipi or, in the terms of item response theory, an 
unbiased item is one for which the item characteristic curves for two groups, 
and hence all three parametets that define those curves, are the same. 

Basic Concepts of Item Response Theory 

The item characteristic curve represents the probability of a correct 
response as a function of a unidimensional ability, 9. The curve is an ogive, 
or s^-shaped curve, that begins as a line along or parallel to the axis on the 
left and rises to a value of 1 on the right where it again becomes parallel to 
the axis. The dimension from left to right represents increasing levels of 
ability; the height of the curve represents the pobability of a correct 
response. At any given ability level, that is, at a given point along the 
axis, the height of the curve at that point is the probability that a person 
with that ability will get the item correct. Examples of item characteristic 
curves are shown in Figures 1-6. 

The exact shape of the item characteristic curve is mathematically defined 
by a logistic function with three paramef vfl: a, b, and jc. The b f rameter 
is the inflection point of the curve and represents the difficulty of ^rhe 
item. If the lower asymptote (the portion of the curve on the left befor:* it 
begins to rise) is zero, b will be the ability level where the probability of 



ERLC 



62 



-61- 

a correct response Is The a parameter Is the slope of the curve at that 

point and repre^ients the discrimination of the Item. The third parameter, 
Is the height of the lower a8yiq>tote and represents the probability of a 
correct response essentially In the absence of knowledge* For multiple choice 
Items this asymptote Is u&ually assumed to be non-*2ero and Is often referred 
to as a guessing parameter* (For additional discussion of the basics of Item 
response theory, see Hambleton & Cook, 1977*) 

The Method 

Item response theory (IRT) has been seen as particularly useful for 
studies of Item bias becai'se, in theory, the Item characteristic curve la not 
dependent on the distribution of ability In the sample used to determine the 
parameters* Hence, If the parameters of an Item characteristic curve are 
estimated separately for two samples drawn from the same population, the 
resultant curves should be the same (except for scale) even though the sajiples 
may differ in the diatribution of abilities within them* If the curves are 
not the same, the conclusion of bias may be warranted* In practice, however, 
parameters may be more difficult to estimate in some samples than in others, 
and the estimation itself is not perfectly precise. Hence, some Indicator of 
the degree of agreement between the two curves Is needed* None of the 
Indicators suggested In the literature to date has been found fully satis- 
factory. Perhaps more important, however. Is the requirement for very large 
sample sizes in order to estimate the parameters well* In many applications, 
roo few minority examinees are available* 

A new method developed by Thlssen for using item response theory provides 
a different approach that resolves some of these problems* The method is 



63 



-62- 

based on a marginal maximum likelihood estimation procedure (Thlssen, 1982b)* 
Its major difference from other procedures Is the ability to Impose equality 
constraints and to evaluate the significance of the difference between the fit 
of the model to the data with and without these constraints* The method Is 
Implemented through use of the CULT computer program (Thlssei , 1982a)* 

In this study, the first step In the procedure for the analysis of a given 
hypothesis was to obtain ability estimates for all examinees* This was done 
on a selected set of Items from the operational sections of the test as well 
as the experimental Items, constraining the Item parameters to be the same for 
both groups. In subsequent analyses, the Items from the operational test were 
always assumed to have the same parameters, providing an anchor test for the 
analyses* The data were then refit, releasing certain of the equality con- 
straints In a systematic fashion* At each step of the process, the fit was 

evaluated with a negative log likelihood statistic (G^), Twice the difference 

2 

between the G for two successive analyses was examined as a test of signifi- 
cance* The degrees of freedom for this statistic were the difference between 

the numbers of parameters estimated la the more and less constrained a Jels* 

2 

If the difference G is not significant, the unconstrained parameters cannot 
be assumed to be different from one another (Thlssen, Steinberg, & Walner, 
19d3). 



Item Bias Results 



Because of the cost and time requirements of the IRT analyses, these pro- 
cedures were applied only for Items from those hypotheses where preliminary 
results suggested outliers might be detected* The item bias analyses were 
performed for the items in verbal Hypotheses 1*1, 2*2, 3*1, and 5*1; 

ERIC 64 



-63- 



quancicacive Hypotheses 7.1, 7.2, and 6.2; and analytic Hypothesis 3.2. 



In these analyses, ability estimates for examinees were developed from 
items in the operational sections, which were taken by all examinees. Only 
items of the same type as the experimental items were used for this purpose in 
order to increase the likelihood that a single ability was being measured. 
That is, for Hypotheses 1.1 and 2.2, only antonym items were used to estimate 
the ability parameters; only reading comprehension items were used for Hypoth- 
eses 3.1 and 5.1. With the math items, parameters for quantitative comparison 
items were estimated separately from the items of the more conventional type. 
For Hypotheses 7.1 and 7.2, both formats had been used so two separate sets of 
analyses were performed within each of the hypotheses based on the different 
ability estimates. Hypothesis 6.2 items were evaluated in terms of ability 
based only on the conventional item type. 



After the analysis of Hypothesis 1. it became apparent that the cost 
could be greatly reduced by using a smaller sample of Whites. Therefore, a 
series of runs were performed to determine how far the sample size could be 
reduced without undue loss of power. The final sample for the remaining 
hypotheses consisted of a one-third spaced sample of the Whites used in the 
analyses described earlier. This final sample was approximately twice che 
size of the Black sample. 

For Hypothesis 1.1, the c parameters were eotimated first, assuming the 
value ot the parameter would be the same for Blacks and Whites for all five 
items of a given item version. The value of the c parameter was, however, 
allowed to be different for the antonym version and the sentence completion 
version. Acceptance of the same c parameter value for all items in a set only 
means that Insufficient data exist to discinguish different c's. Estimating a 



ERIC 




-64- 

common value helps stabilize the estimate of other paraaeters and facilitates 
Interpretation. The a and _b paraaeters were then estimated for each Item. 
The data were found to be consistent with equal parameters for Blacks and ^ 
Whltco for all of the antonym Items and for four of the five sentence 
completion Items* For the remaining Item, the difficulty parameters were 
found to be significantly different* with the Item more difficult for Blacks. 
The verbal Items that showed parameter differences are summarized In Table 9, 
appearing In the body of this paper. 

Hypothesis 2.2 was found to be somewhat more complex. For these Items the 
hypothesis of equal £ parameters for Blacks and Whites could not be confirmed 
for one forji of the test. The five Items were estimated to have a £ of .25 
for Whites and .15 for Blacks. For relatively easy Items, however, there was 
little data available In the ability regions where the asymptotes occur. That 
Is, for relatively easy Items, even the least able examinees were responding 
at a level well above chance. This hypothesis of equal £'s was only apt to be 
rejected, therefore, where Items were sufficiently difficult for data to be 
available In this range. The observed difference In £ parameters was thus 
probably the result of the most difficult Item. i7, and perhaps Item 15. Item 

17, however, also showed differences In both the £ and b parameters In both 
Item versions. In Version B (more common meaning), this Item favored Blacks, 
but In the alternate version, the Item characteristic curves crossed; hence, 
the relative performance of the two groups would be dependent on their 
respective abilities. The Item characteristic curves for the two versions of 
this Item are shown In Figure 1. 

The Items for Hypothesis 3.1 again appeared to be unbiased except for Item 

18. The version with one false answer (Version A) was clearly biased In favor 
of Whites, with a moderate difference In difficulty and a small difference In 
slope. The other version did not differ In difficulty, but had a large dlf- 

ERIC 66 



Figure 1 
Item Characteristic Curves 
Hypothesis 2.2 





HYPOTHESIS 2.2 




VEKSION^A ITtM=i7 


1,0- 

: 








B 0.8- 


/ / 


0 


/ / 


B 


/ / 


R 


/ / 


B 


/ / 
/ / 


I 0, 6- 




L ^ ■ 


/ / 


I 




T 


/ / 


Y 


/ / 


0,4- 


/ / 


C 
0 


/ / 


n 




B 





E 0-2- 




C 




T 




0. 0- 


\ 1 1 i TTprt '^■^T. .| 1 




3-2-10123 




THETfl 


LEGENO : 


GHOUP BLfiCKS WHITES 



er|c 





HYPOTHESIS 2.2 

1 1 1 1 III k W l«^ mm*m^ 




VERSIOK=B ITEM=17 


1,0^ 








• 

p 


/ / 
/ / 


B 0,8- 


/ / 


(3 


f / 


B 


/ / 


fl 
B 


f / 
/ J 
/ 1 


T n fi-^ 

i U, w 


f 1 


L : 


/ / 


T 


/ / 


T 


/ / 


r 


/ / 


0,4- 


/ / 
/ / 


c 


/ / 


fl 


/ / 


B 




B 




E 0,2- 




C 




T \ 




0,0" 






3-2-10123 




THETfl 


LEGEND ; 


GROUP BLRCKS WHITES 



68 



-66- 

ference In slope, so that again the group favored changed with the level of 
ability. The item characterisric curves for the two versions of this item ere 
shown ii^ Figure 2. 

For Hypothesis 5.1, none of the items showed significant differences In 
the a^ and b parameters. However, the lower asymptotes were different for both 
foruis ot the test. In this case, only item 25 was nearly difficult enough to 
be producing this result. In one version (test-wlseness cuos absent), a small 
but significant difference existed that favored the Black group; i^ the other, 
a slightly larger difference favored the Whites. The change Irom one item to 
the other was larger for Blacks. It should be noted, however, that the 
significance of differences between versions was not tested, and that the .23 
value of c for B7^^ks in Version A was equal to the obtained percent correct 
for that group. 

For quantitative hypothesi«< 6.2, £'s were not constrained to be equal for 
all items of a given version but were estimated separately for each item. 
Only two items showed evidence of bias, Version B (key not C) of item 13 and 
both versions of j-.em 28. Item 13 appeared to he both more difficult and more 
discriminating and had a higher asymptote for Whites. The £ parameter for 
Whites was .63, however, an extremely high value for which no reasonable 
explanation is apparent in light of the othe \ata. The three parameters for 
the Black gViH p n Version B were very similar to those for both groups in the 
unbiaaed Version A, where a was .6, b was -.5, and c, .27, suggesting that the 
change in option placement largely affected the performance of the White 
examinees. Parameter values for the Version E item are shown in Table 10 in 
the body of the paper. Both versions of item 28 were both more difficult and 
more discriminating for Blacks. Version B had an asymptote slightly higher 
for Blacks* Again the obtained parametu^ . alues are given in Table 10. Item 
^ characteristic curves for these items are shown In Figure 3. 

< 69 



Figure 2 
Item Characteristic Curves 
Hypothesis 3.1 



HYPOTHESIS 3.1 
VERSION=:A ITEM=18 




c 

T 



-3-2-10123 
THETR 

LEGEND. GROUP BLfiCKS HHITES 



HYPOTHESIS 3.1 
VERSION^B ITEM=:18 




THETfl 

LEGEND: GROUP BLfiCKS HHITES 



70 



71 



-68- 



Fi^ 3 
Item Characteristic Curves 
Hypothesis 6.2 



HYPOTHESIS 6.2 
VERSION^B ITEM»13 




HYPOTHESIS 6.2 
VERSION=A ITEM=28 




ly^^,,,,,, , ,,,,,,,,,, I , 

•3-2-10123 



THETfi 

LECENO CROUP BLfiCKS WHITES 



HYPOTHESIS 6,2 
VERSION^B ITEM>28 




0 O l r 

-3-2-10123 



ThETR 

LEGENO. "/ROUP BLfi'^KS WHITES 



ERLC 



72 



-69- 



The two quantitative comparison items for Hypothesis 7.1 (iteas 1 and 4) 
were analyzed separately from the other four. No eviden::e of bias was seen 
here. For the other four items » the hypothesis that the parameters were the 
same for both groups was rejected within both item versions. Further, item 17 
in Version A (numbers) cjuld not be shown to have the same c parameters as the 
other items in that form. The asymptotes were lower for both groups than for 
the other items of the same version, though higher for Whites and lower for 
Blacks than in the alternate item 17. It also showed small differences in 
discrimination. 

The other items showed a higher asymptote for Blacks than for Whites, with 
much higher c^s in Version A (with numbers) than in Version B (with symbols). 
It is not entirely clear which items produced sufficient data to reject the 
hypothesis of equal £'s» but the most likely candidates appear to be item 27 
in Version A and item 17 in Version B, neither of which showed bias in the 
other parai^eters. Item 19 showed differences in both the a and b parameters 
in both versions with the item both more difficult and more discrimincting for 
Blacks. Parameter vr^lues for the biased items are given in Table 10. Item 
characteristic curves for item 17 Version A and item 19 Versions A and B are 
shown in Figure 4. 

Four of the six items for Hypothesis 7.2 (items 2, 3, 5, and 6) were 
quantitative comparisons and were analyzed together. For Version A (with 
diagrams), all four items were found to be unbiased. For Version B (with 
verbal description), the £ pari^'-ieters were estimated for the fot'r items 
together and were found to be higher for Blacks (.14) than for Whites (.01), 
in contrast with e of .16 for both groups on Version A. The items 
responsible for this result seem most likely to have been item 2 and/or item 
6. Item 2 did not show differences on the other parameters, but item 6 was 
again both more difficult and more discriminating for Blacks. This result was 



ERIC 




-70- 



rigure 4 
Item Characteristic Curves 
Hypothesis 7.1 



HYPOTHESIS 7.1 
VERSION^^A ITEM=17 




1 1 1 1 

-3-2-10123 
TMETfl 

lEGENO. CflOUP BLfiCKS MMITES 



HYPOTHESIS 7 1 
VERSION^A )TEM-19 




V " " ' """ ! JWVWW W WW,,J n t> I 

-3-2-1 0 1 2 3 

TMETR 

LEGENO GROUP BLfiCKS MHITES 



HYPOTHESIS 7.1 
VERSI0K=8 ITEM=19 



1 OH 




E 0.2-.--' 



C 
1 



'l I H' ■! I I I 

-3 -2 M 0 I 2 3 

TMETfl 

lEGENO. CROUP BLfiCKS MHlTES 



ERLC 



74 



also found for item 3. Item characteristic f jrves for all four Version B 
items are shown in Figure 3. 

For the two remaining items in Hypothesis 7.2, the £ parameters were 
estimated separately within each version. The Version B (description) items 
used a common £ estimate for Whites and Blacks and appeared to be otherwise 
unbiased. In Version A, item 15 w^s both more difficult and more 
discriminating for Whites and also had a much higher asymptote for Whites , 
.40, in contrast to .16 for Blacks and .21 for the B version- Item 23 was 
more difficult and discriminating for Blacks with a somewhat higher asymptote. 
The item characteristic curves ior these items are shown in Figure 6. The 
parameter values for all biased items for Hypothesis 7.2 are given in Table 10. 

Analyses for the analytic Hypothesis 3.2 were based on ability estimates 
using all items from the operational analytic sections, since item types used 
Were quite similar for all items. No significant differences between param- 
eters for theoe items were found. 



-72- 



Figure 5 
Item Characteristic Curves 

Hypothesis 7.2 
Quantitative Coioparisons 



HYPOTHESIS 7.2 
QUANTITATIVE COMPARISONS 
VERStON=B ITEM=2 




»• ■■ " ■■■■ I , .... „ ..,,,,, ,^ . . n J 

-3-2-10123 
THETB 

lEGE^IO: GROUP BLftCRS KHITES 



HYPOTHESIS 7.2 
QUANTITATIVE COMPARISONS 
VERStON=B ITEM=3 




I' "" " »■■ ■ ■ " ' ! ' 

-3 -2 -J 0 I 2 

TMfTfl 

LEGEND: CHOUP BLACKS MMITE5 



ERIC 



1 OH 



0 0- 



HYPOTHESIS 7.2 
QUANTITATIVE COMPARISONS 
VERS!ON=B ITEM=5 



-3 




-I 



0 1 
THETfl 

LEGCNO: GROUP BLfiCRS — 



HHITES 



HYPOTHESIS 7.2 
QUANTITATIVE COMPARISONS 
VERSION=B ITEM=6 



l.OH 



0.0- 




-3 



LEGEND. CROUP 



0 

8LPCKS 



76 



HHITES 



Figure 6 
Item Characteristic Cur*/es 
Hypothesis 7*2 
Standard Math Items 



l.OH 



P 

R 0,8- 

Q 

B 

R 

B 

I 0,6- 

L 

I 

T 

T 

0, H 

c 

0 
R 
R 

E 0-2- 

C 

T 



0. 0- 



HYPOTHESIS 7.2 
MATH TYPE ITEMS 
VERSION=A ITEM=15 




-3 



-1 



LEGEND, CROUP 



0 

THETfl 
— BLfiCKS 



HHJTES 



HYPOTHESIS 7.2 
MATH TYPE ITEMS 
VERSION=A ITEM=23 




LEGEND, CROUP 



THETfl 
— BLflCKS 



WHITES 



ERLC 



77 



78 



